At Zerodha, we run a multitude of internal and public-facing services that generate copious amounts of logs. While developers use these logs to debug or troubleshoot incidents, some services also emit logs that must be persisted for prolonged periods to comply with numerous regulatory requirements. In this post, I will delve into our experiences with the ELK stack, why it didn’t fit our needs and our migration to ClickHouse.
Why ELK wasn’t the right fit for us
In 2018, we adopted the ELK stack as our de facto stack for storing application logs. Using Filebeat, we shipped logs from EC2 instances to a central Logstash server. Logstash served as the aggregation layer, where we applied multiple filters to add metadata, scrub sensitive data, parse the logs, and dump them into various indices in Elasticsearch. Kibana was the default query layer for writing queries and fetching results. However, as our traffic volumes and user base increased significantly post-2020, our ELK stack began to struggle with the rising load.
In 2018, we handled less than 1/10th of our current traffic, so the single-node Elasticsearch worked well for us. Although it still required a decently large EC2 instance because of the huge JVM memory requirements, we hosted it without facing many challenges. However, post-2020, when our traffic volumes spiked and our user base increased much faster, the underlying logging system also started to choke.
A breakdown of our pain points with the ELK stack:
- Logstash pipeline would experience frequent slowdowns during peak traffic loads, resulting in slower ingestion times into Elasticsearch
- Storage costs skyrocketed as each incoming field is indexed for searching, inflating the storage requirements for each log event by a minimum of 3-4x the original size
- User experience for searching logs was subpar. The syntax for querying logs using Lucene or KQL was complex and not intuitive.
- Elasticsearch is slow for aggregate queries over a long period.
Choosing the right stack
As our traffic volumes continued to rise, it became apparent that we needed to migrate away from the ELK stack. We required a new stack that met the following criteria:
- Easy to operate: Log management is a complex issue, and the new stack should not exacerbate the problem.
- Cost-effective: Given that we are required to retain logs for multiple years, cheaper storage costs are desirable.
- Scalability: The new stack should be easily scalable to handle our increasing traffic volumes.
- Robust RBAC system: The new system must support creating ACL groups, allowing developers and business teams to query the underlying storage with varying levels of permission controls.
Log Events
To give you some context, we need to collect a variety of logs from our system, which runs on a mix of standalone EC2 instances and Nomad clusters.
Here’s a breakdown of the different types of logs and how we manage them:
App logs: For our application logs, we’ve developed a fixed schema that we use internally. Initially, we wrote a small wrapper around onelog to conform the output to our schema. However, we found that reading JSON logs while developing locally was tedious and not human-friendly. That’s when we discovered logfmt, which struck the right balance between human readability and machine parsability. To emit these logs and gradually migrate our applications to switch over to this library, we built and open-sourced an extremely performant and zero-allocation logging library in Go.
HTTP logs: We rely heavily on HAProxy and NGINX as the proxy layer between ALBs and our apps. To capture the required fields for our desired HTTP schema, we’ve written custom log templates for both proxies.
SMTP logs: For a long time we used self-hosted Postal SMTP servers for our emailing requirements. However, for our scale it was pretty slow and we started to look for faster alternatives. We found Haraka which gave us a 10x improvement in resource usage and throughput, because of its architecture and no external DB bottlenecks. However, Haraka doesn’t provide good logging out of the box. To address this, we built and open-sourced a Haraka plugin that emits JSON logs containing details such as the SMTP response, recipient, and subject.
The hunt for a good solution
Up until 2021, we dealt with the scaling of ELK issues by throwing more hardware at the problem. We also did extensive R&D with alternate options.
Here’s a rundown of what we found:
Grafana Loki – first impressions
Back in late 2019, we took a look at Grafana Loki 0.4. At the time, it was a relatively new project with limited storage options and a complicated microservice-oriented deployment setup. We ultimately decided against it due to the complexity of self-hosting it. But we kept an eye on it, hoping things would improve in the future.
Loki’s evolution and performance testing
Fast forward to 2021, and Loki had come a long way. Version 2.4 had simplified its deployment setup and addressed many past pain points. We were excited to see its new features, such as out-of-order writes in the index and BoltDB-based index storage for the filesystem. However, when we benchmarked its performance by importing a month’s worth of logs from our busiest application, Kite, the results were disappointing. Query times were painfully slow, and most queries timed out. Additionally, we found that Loki lacked good RBAC support in its open-source version.
Exploring cLoki and Clickhouse
In our search for a better solution, we discovered cLoki, a project that implemented Loki’s API with Clickhouse for storage. While it initially looked promising, we quickly realized that it didn’t use native Clickhouse compression codes and didn’t provide the flexibility to design a custom schema for our app logs. Query performance for non-trivial queries involving partial matching was also unimpressive. Moreover, we couldn’t add configurable partition options for different tables or parameters like TTLs or custom ordering keys/indices.
Welcome to Loj
Loj is the internal name we came up with while deciding on a name for our logging project. It was originally meant to be pronounced as log where g is pronounced as jee, but all of us pronounce it as loj. No one remembers why exactly it was named so, but clearly, we excel at naming things creatively.
While the DevOps team was exploring different stacks for storing logs, the back office (Console) team started to explore Clickhouse for a “warehouse” storage. This warehouse stores the holdings and breakdown of each trade done by every user on our platforms. We’d gained enough confidence to use it for logging also. In addition, we’d also been keeping an eye on it since Cloudflare posted about their adoption of Clickhouse for storing HTTP analytic logs. Since Clickhouse is an excellent choice for large OLAP workloads such as this warehouse and can handle large volumes of immutable data, we decided to try using it for storing our logs.
Loj is essentially a mix of 3 different components for our logging pipeline:
- Vector: For shipping logs to central storage.
- Clickhouse: Storage layer for all the logs.
- Metabase: Query/UI layer for visualizing/performing queries.
Why Clickhouse is an excellent fit for logging:
Clickhouse, a column-oriented DBMS, is well suited for OLAP workloads. Logging workloads share similar attributes with OLAP workloads, such as being read-heavy, having large batches of inserts but rare mutations, and containing mostly immutable data.
Clickhouse’s compression codecs:
Clickhouse supports various compression codecs out of the box, including LZ4, Brotili, and ZSTD, at the block level. These codecs can reduce the amount of data that needs to be read from the disk, enhancing performance. Additionally, Clickhouse supports dictionary compression, which is useful for columns with low cardinalities.
Scalability with Clickhouse:
Clickhouse is relatively easy to scale and comes bundled with clickhouse-keeper, which makes it possible to set up a distributed Clickhouse cluster with multiple nodes replicating data with each other. At Zerodha, we use a distributed Clickhouse cluster setup for storing all our logs. The cluster has two shards and two active replicas for each shard, totalling up to four nodes. If you want to learn how to configure clickhouse-keeper, you can refer to this post.
Schema for storing logs:
Designing a proper schema is a crucial step in ensuring optimal query performance in ClickHouse. While most of our apps follow a standard schema, we create tables with a specific schema for exceptional cases such as storing SMTP logs and AWS Pinpoint logs.
Here’s an example of our app schema: