ClickHouse is the fastest and most resource-efficient open-source database for real-time applications and analytics. As one of its components, ClickHouse Keeper is a fast, more resource-efficient, and feature-rich alternative to ZooKeeper. This open-source component provides a highly reliable metadata store, as well as coordination and synchronization mechanisms. It was originally developed for use with ClickHouse when it is deployed as a distributed system in a self-managed setup or a hosted offering like CloudHouse Cloud. However, we believe that the broader community can benefit from this project in additional use cases.
In this post, we describe the motivation, advantages, and development of ClickHouse Keeper and preview our next planned improvements. Moreover, we introduce a reusable benchmark suite, which allows us to simulate and benchmark typical ClickHouse Keeper usage patterns easily. Based on this, we present benchmark results highlighting that ClickHouse Keeper uses up to 46 times less memory than ZooKeeper for the same volume of data while maintaining performance close to ZooKeeper.
Modern distributed systems require a shared and reliable information repository and consensus system for coordinating and synchronizing distributed operations. For ClickHouse, ZooKeeper was initially chosen for this. It was reliable through its wide usage, provided a simple and powerful API, and offered reasonable performance.
However, not only performance but also resource efficiency and scalability have always been a top priority for ClickHouse. ZooKeeper, being a Java ecosystem project, did not fit into our primarily C++ codebase very elegantly, and as we used it at a higher and higher scale, we started running into resource usage and operational challenges. In order to overcome these shortcomings of ZooKeeper, we built ClickHouse Keeper from scratch, taking into account additional requirements and goals our project needed to address.
ClickHouse Keeper is a drop-in replacement for ZooKeeper, with a fully compatible client protocol and the same data model. Beyond that, it offers the following benefits:
- Easier setup and operation: ClickHouse Keeper is implemented in C++ instead of Java and, therefore, can run embedded in ClickHouse or standalone
- Snapshots and logs consume much less disk space due to better compression
- No limit on the default packet and node data size (it is 1 MB in ZooKeeper)
- No ZXID overflow issue (it forces a restart for every 2B transactions in ZooKeeper)
- Faster recovery after network partitions due to the use of a better-distributed consensus protocol
- Additional consistency guarantees: ClickHouse Keeper provides the same consistency guarantees as ZooKeeper – linearizable writes plus strict ordering of operations inside the same session. Additionally, and optionally (via a quorum_reads setting), ClickHouse Keeper provides linearizable reads.
- ClickHouse Keeper is more resource efficient and uses less memory for the same volume of data (we will demonstrate this later in this blog)
The development of ClickHouse Keeper started as an embedded service in the ClickHouse server in February 2021. In the same year, a standalone mode was introduced, and Jepsen tests were added – every 6 hours, we run automated tests with several different workflows and failure scenarios to validate the correctness of the consensus mechanism.
At the time of writing this blog, ClickHouse Keeper has been production-ready for more than one and a half years and has been deployed at scale in our own ClickHouse Cloud since its first private preview launch in May 2022.
In the rest of the blog, we sometimes refer to ClickHouse Keeper as simply “Keeper,” as we often call it internally.
Generally, anything requiring consistency between multiple ClickHouse servers relies on Keeper:
- Keeper provides the coordination system for data replication in self-managed shared-nothing ClickHouse clusters
- Automatic insert deduplication for replicated tables of the mergetree engine family is based on block-hash-sums stored in Keeper
- Keeper provides consensus for part names (based on sequential block numbers) and for assigning part merges and mutations to specific cluster nodes
- Keeper is used under the hood of the KeeperMap table engine which allows you to use Keeper as consistent key-value store with linearizable writes and sequentially consistent reads
- read about an application utilizing this for implementing a task scheduling queue on top of ClickHouse
- Kafka Connect Sink uses this table engine as a reliable state store for implementing exactly-once delivery guarantees
- Keeper keeps track of consumed files in the S3Queue table engine
- Replicated Database engine stores all metadata in Keeper
- Keeper is used for coordinating Backups with the ON CLUSTER clause
- User defined functions can be stored in Keeper
- Access control information can be stored in Keeper
- Keeper is used as a shared central store for all metadata in ClickHouse Cloud
In the following sections, in order to observe (and later model in a benchmark) some of ClickHouse Cloud’s interaction with Keeper, we load a month of data from the WikiStat data set into a table in a ClickHouse Cloud service with 3 nodes. Each node has 30 CPU cores and 120 GB RAM. Each service uses its own dedicated ClickHouse Keeper service consisting of 3 servers, with 3 CPU cores and 2 GB RAM per Keeper server.
The following diagram illustrates this data-loading scenario:
Via a data load query, we load ~4.64 billion rows from ~740 compressed files (one file represents one specific hour of one specific day) in parallel with all three ClickHouse servers in ~ 100 seconds. The peak main memory usage on a single ClickHouse server was ~107 GB:
0 rows in set. Elapsed: 101.208 sec. Processed 4.64 billion rows, 40.58 GB (45.86 million rows/s., 400.93 MB/s.)
Peak memory usage: 107.75 GiB.
For storing the data, the 3 ClickHouse servers together created 240 initial parts in object storage. The average number of rows per initial part was ~19 million rows, respectively. The average size was ~100 MiB, and the total amount of inserted rows is 4.64 billion:
┌─parts──┬─rows_avg──────┬─size_avg───┬─rows_total───┐
│ 240.00 │ 19.34 million │ 108.89 MiB │ 4.64 billion │
└────────┴───────────────┴────────────┴──────────────┘
Because our data load query utilizes the s3Cluster table function, the creation of the initial parts is evenly distributed over the 3 ClickHouse servers of our ClickHouse Cloud services:
┌─n─┬─parts─┬─rows_total───┐
│ 1 │ 86.00 │ 1.61 billion │
│ 2 │ 76.00 │ 1.52 billion │
│ 3 │ 78.00 │ 1.51 billion │
└───┴───────┴──────────────┘
During the data loading, in the background, ClickHouse executed 1706 part merges, respectively:
┌─merges─┐
│ 1706 │
└────────┘
ClickHouse Cloud completely separates the storage of data and metadata from the servers. All data parts are stored in shared object storage, and all metadata is stored in Keeper. When a ClickHouse server has written a new part to object storage (see ② above) or merged some parts to a new larger part (see ③ above), then this ClickHouse server is using a multi-write transaction request for updating the metadata about the new part in Keeper. This information includes the name of the part, which files belong to the part, and where the blobs corresponding to files reside in object storage. Each server has a local cache with subsets of the metadata and gets automatically informed about data changes by a Keeper instance through a watch-based subscription mechanism.
For our aforementioned initial part creations and background part merges, a total of ~18k Keeper requests were executed. This includes ~12k multi-write transaction requests (containing only write-subrequests). All other requests are a mix of read and write requests. Additionally, the ClickHouse servers received ~ 800 watch notifications from Keeper:
total_requests: 17705
multi_requests: 11642
watch_notifications: 822
We can see how these requests were sent and how the watch notifications got received quite evenly from all three ClickHouse nodes:
┌─n─┬─total_requests─┬─multi_requests─┬─watch_notifications─┐
│ 1 │ 5741 │ 3671 │ 278 │
│ 2 │ 5593 │ 3685 │ 269 │
│ 3 │ 6371 │ 4286 │ 275 │
└───┴────────────────┴────────────────┴─────────────────────┘
The following two charts visualize these Keeper requests during the data-loading process:
We can see that ~70% of the Keeper requests are multi-write transactions.
Note that the amount of Keeper requests can vary based on the ClickHouse cluster size, ingest settings, and data size. We briefly demonstrate how these three factors influence the number of generated Keeper requests.
If we load the data with 10 instead of 3 servers in parallel, we ingest the data more than 3 times faster (with the SharedMergeTree):
0 rows in set. Elapsed: 33.634 sec. Processed 4.64 billion rows, 40.58 GB (138.01 million rows/s., 1.21 GB/s.)
Peak memory usage: 57.09 GiB.
The higher number of servers generates more than 3 times the amount of Keeper requests:
total_requests: 60925
multi_requests: 41767
watch_notifications: 3468
For our original data load, run with 3 ClickHouse servers, we configured a max size of ~25 million rows per initial part to speed up ingest speed at the expense of higher memory usage. If, instead, we run the same data load with the default value of ~1 million rows per initial part, then the data load is slower but uses ~9 times less main memory per ClickHouse server:
0 rows in set. Elapsed: 121.421 sec. Processed 4.64 billion rows, 40.58 GB (38.23 million rows/s., 334.19 MB/s.)
Peak memory usage: 12.02 GiB.
And ~4 thousand instead of 240 initial parts are c