We are super excited to share a trove of amazing features in 23.10
And, we already have a date for the 23.11 release, please register now to join the community call on December 5th at 9:00 AM (PDT) / 6:00 PM (CET).
23 new features. 26 performance optimisations. 60 bug fixes.
A small subset of highlighted features are below…But the release covers new SHOW MERGES
and SHOW SETTINGS
commands, new byteSwap
, arrayRandomSample
, jsonMergePatch
, formatQuery
, formatQuerySingleLine
functions, argMin
and argMax
as combinators, parameterized ALTER
command with partitions, untuple
function with better names, enforcing projections, allowing tables without a primary key, and so…much…more.
As always, we send a special welcome to all the new contributors in 23.10! ClickHouse’s popularity is, in large part, due to the efforts of the community that contributes. Seeing that community grow is always humbling.
If you see your name here, please reach out to us…but we will be finding you on twitter, etc as well.
AN, Aleksa Cukovic, Alexander Nikolaev, Avery Fischer, Daniel Byta, Dorota Szeremeta, Ethan Shea, FFish, Gabriel Archer, Itay Israelov, Jens Hoevenaars, Jihyuk Bok, Joey Wang, Johnny, Joris Clement, Lirikl, Max K, Priyansh Agrawal, Sinan, Srikanth Chekuri, Stas Morozov, Vlad Seliverstov, bhavuk2002, guoxiaolong, huzhicheng, monchickey, pdy, wxybear, yokofly
Largest Triangle Three Buckets is an algorithm for downsampling data to make it easier to visualize. It tries to retain the visual similarity of the initial data while reducing the number of points. In particular, it seems to be very good at retaining local minima and maxima, which are often lost with other downsampling methods.
We’re going to see how it works with help from the Kaggle SF Bay Area Bike Share dataset, which contains one CSV file that tracks the number of docks available per station on a minute-by-minute basis.
Let’s create a database:
CREATE DATABASE BikeShare;
USE BikeShare;
And then create a table, status, populated by the status.csv file:
create table status engine MergeTree order by (station_id, time) AS
from file('Bay Area Bikes.zip :: status.csv', CSVWithNames)
SELECT *
SETTINGS schema_inference_make_columns_nullable=0;
SELECT formatReadableQuantity(count(*))
FROM status
┌─formatReadableQuantity(count())─┐
│ 71.98 million │
└─────────────────────────────────┘
Raw data
Let’s first have a look at the raw data for one of the stations over a period of a few days. There are 4,537 points returned by the following query, which is stored in the file raw.sql:
from BikeShare.status select toUnixTimestamp64Milli(time), docks_available
where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01'
and station_id = 70
FORMAT CSV
We can visualize the docks available over time by running the following query:
clickhouse local --path bikeshare.chdb < raw.sql |
uplot line -d, -w 100 -t "Raw Data"
Next, we’re going to see what happens if we reduce the number of points by roughly 10x, which we can do by averaging the points in buckets of 10 minutes. This query will be stored in the file avg.sql and is shown below:
WITH buckets AS (
SELECT
toStartOfInterval(time, INTERVAL 10 minute) AS bucket,
AVG(docks_available) AS average_docks_available,
AVG(toUnixTimestamp64Milli(time)) AS average_bucket_time
FROM BikeShare.status
where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01'
AND (station_id = 70)
GROUP BY bucket
ORDER BY bucket
)
SELECT average_bucket_time, average_docks_available
FROM buckets
FORMAT CSV
We can generate the visualization like this:
clickhouse local --path bikeshare.chdb < avg.sql |
uplot line -d, -w 100 -t "Average every 5 mins"
This downsampling isn’t too bad, but it has lost some of the more subtle changes in the shape of the curve. The missing changes are circled in red on the raw data visualization:
Let’s see how the Largest Triangle Three Buckets algorithm does. The query (lttb.sql) is shown below:
from BikeShare.status
select untuple(arrayJoin(
largestTriangleThreeBuckets(50)(
toUnixTimestamp64Milli(time), docks_available
)))
where toDate(time) >= '2013-08-29' and toDate(time) <= '2013-09-01' AND station_id = 70
FORMAT CSV
And we can generate the visualization like this:
clickhouse local --path bikeshare.chdb < lttb.sql |
uplot line -d, -w 100 -t "Largest Triangle Three Buckets"
From a visual inspection, this version of the visualization is only missing the following local minima:
ClickHouse provides SQL with many extensions and powerful improvements that make it more friendly for analytical tasks. One example of this ClickHouse superset of SQL
is extensive support for arrays. Arrays are well-known to users of other programming languages like Python and JavaScript. They are generally useful for modeling and solving a wide range of problems in an elegant and simple way. ClickHouse has over 70 functions for processing arrays, with many of these functions being higher-order functions providing a high level of abstraction, allowing you to express complex operations on arrays in a concise and declarative manner. We proudly announce that this family of array functions now has a new, long-awaited, and most powerful member: arrayFold.
arrayFold is equivalent to the Array.reduce function in JavaScript and is used to fold or reduce the elements in an array from left to right by applying a lambda-function to the array elements in a cumulative manner, starting from the leftmost element and accumulating a result as it processes each element. This cumulative process can be thought of as folding
the elements of the array together.
The following is a simple example where we use arrayFold
for calculating the sum of all elements of the array [10, 20, 30]
:
SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30], 0::UInt64) AS sum
┌─sum─┐
│ 60 │
└─────┘
Note that we are passing both a lambda function (acc, v) -> (acc + v)
and an initial accumulator value 0
in the example call of arrayFold
above.
The lambda function is then called with acc
set to the initial accumulator value 0
and v
set to the first (most left) array element 10
. Next, the lambda function is called with acc
set to the result of the previous step and v
set to the second array element 20
. This process continues, iteratively folding the array elements from left to right until the end of the array is reached, producing a final result, 60
.
This diagram visualizes how the +
operator from the body of our lambda function is cumulatively applied to the initial accumulator and all array elements from left to right:
We used the example above just as an introduction. We could have used arraySum or arrayReduce(sum
) for calculating the sum of all array elements. But arrayFold
is far
=span>=span>=span>