ClickHouse Release 23.10 by deznu

Share This Article

Sed ut perspiciatis unde.

We are super excited to share a trove of amazing features in 23.10

And, we already have a date for the 23.11 release, please register now to join the community call on December 5th at 9:00 AM (PDT) / 6:00 PM (CET).

23 new features. 26 performance optimisations. 60 bug fixes.

A small subset of highlighted features are below…But the release covers new SHOW MERGES and SHOW SETTINGS commands, new byteSwap, arrayRandomSample, jsonMergePatch, formatQuery, formatQuerySingleLine functions, argMin and argMax as combinators, parameterized ALTER command with partitions, untuple function with better names, enforcing projections, allowing tables without a primary key, and so…much…more.

As always, we send a special welcome to all the new contributors in 23.10! ClickHouse’s popularity is, in large part, due to the efforts of the community that contributes. Seeing that community grow is always humbling.

If you see your name here, please reach out to us…but we will be finding you on twitter, etc as well.

AN, Aleksa Cukovic, Alexander Nikolaev, Avery Fischer, Daniel Byta, Dorota Szeremeta, Ethan Shea, FFish, Gabriel Archer, Itay Israelov, Jens Hoevenaars, Jihyuk Bok, Joey Wang, Johnny, Joris Clement, Lirikl, Max K, Priyansh Agrawal, Sinan, Srikanth Chekuri, Stas Morozov, Vlad Seliverstov, bhavuk2002, guoxiaolong, huzhicheng, monchickey, pdy, wxybear, yokofly

Largest Triangle Three Buckets is an algorithm for downsampling data to make it easier to visualize. It tries to retain the visual similarity of the initial data while reducing the number of points. In particular, it seems to be very good at retaining local minima and maxima, which are often lost with other downsampling methods.

We’re going to see how it works with help from the Kaggle SF Bay Area Bike Share dataset, which contains one CSV file that tracks the number of docks available per station on a minute-by-minute basis.

Let’s create a database:

CREATE DATABASE BikeShare;
USE BikeShare;

And then create a table, status, populated by the status.csv file:

create table status engine MergeTree order by (station_id, time) AS
from file('Bay Area Bikes.zip :: status.csv', CSVWithNames)
SELECT *
SETTINGS schema_inference_make_columns_nullable=0;

SELECT formatReadableQuantity(count(*))
FROM status

┌─formatReadableQuantity(count())─┐
│ 71.98 million                   │
└─────────────────────────────────┘

Raw data

Let’s first have a look at the raw data for one of the stations over a period of a few days. There are 4,537 points returned by the following query, which is stored in the file raw.sql:

from BikeShare.status select toUnixTimestamp64Milli(time), docks_available
where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01'
and station_id = 70
FORMAT CSV

We can visualize the docks available over time by running the following query:

clickhouse local --path bikeshare.chdb < raw.sql |
uplot line -d, -w 100 -t "Raw Data"

Next, we’re going to see what happens if we reduce the number of points by roughly 10x, which we can do by averaging the points in buckets of 10 minutes. This query will be stored in the file avg.sql and is shown below:

WITH buckets AS (
  SELECT
    toStartOfInterval(time, INTERVAL 10 minute) AS bucket,
    AVG(docks_available) AS average_docks_available,
    AVG(toUnixTimestamp64Milli(time)) AS average_bucket_time
  FROM BikeShare.status
  where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01'
  AND (station_id = 70)
  GROUP BY bucket
  ORDER BY bucket
)
SELECT average_bucket_time, average_docks_available
FROM buckets
FORMAT CSV

We can generate the visualization like this:

clickhouse local --path bikeshare.chdb < avg.sql |
uplot line -d, -w 100 -t "Average every 5 mins"

This downsampling isn’t too bad, but it has lost some of the more subtle changes in the shape of the curve. The missing changes are circled in red on the raw data visualization:

Let’s see how the Largest Triangle Three Buckets algorithm does. The query (lttb.sql) is shown below:

from BikeShare.status
select untuple(arrayJoin(
  largestTriangleThreeBuckets(50)(
    toUnixTimestamp64Milli(time), docks_available
)))
where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01' AND station_id = 70
FORMAT CSV

And we can generate the visualization like this:

clickhouse local --path bikeshare.chdb < lttb.sql |
uplot line -d, -w 100 -t "Largest Triangle Three Buckets"

From a visual inspection, this version of the visualization is only missing the following local minima:

ClickHouse provides SQL with many extensions and powerful improvements that make it more friendly for analytical tasks. One example of this ClickHouse superset of SQL is extensive support for arrays. Arrays are well-known to users of other programming languages like Python and JavaScript. They are generally useful for modeling and solving a wide range of problems in an elegant and simple way. ClickHouse has over 70 functions for processing arrays, with many of these functions being higher-order functions providing a high level of abstraction, allowing you to express complex operations on arrays in a concise and declarative manner. We proudly announce that this family of array functions now has a new, long-awaited, and most powerful member: arrayFold.

arrayFold is equivalent to the Array.reduce function in JavaScript and is used to fold or reduce the elements in an array from left to right by applying a lambda-function to the array elements in a cumulative manner, starting from the leftmost element and accumulating a result as it processes each element. This cumulative process can be thought of as folding the elements of the array together.

The following is a simple example where we use arrayFold for calculating the sum of all elements of the array [10, 20, 30]:

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum

┌─sum─┐
│  60 │
└─────┘

Note that we are passing both a lambda function (acc, v) -> (acc + v) and an initial accumulator value 0 in the example call of arrayFold above.

The lambda function is then called with acc set to the initial accumulator value 0 and v set to the first (most left) array element 10. Next, the lambda function is called with acc set to the result of the previous step and v set to the second array element 20. This process continues, iteratively folding the array elements from left to right until the end of the array is reached, producing a final result, 60.

This diagram visualizes how the + operator from the body of our lambda function is cumulatively applied to the initial accumulator and all array elements from left to right:

We used the example above just as an introduction. We could have used arraySum or arrayReduce(sum) for calculating the sum of all array elements. But arrayFold is far

ClickHouse Release 23.10 by deznu

ClickHouse Release 23.10 by deznu

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

ClickHouse Release 23.10 by deznu

ClickHouse Release 23.10 by deznu

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter