While centralized logging previously existed at Twitter, it was limited by low ingestion capacity, and limited query capabilities which resulted in poor adoption and failed to deliver the value we hoped for. To address this, we adopted Splunk Enterprise and migrated centralized logging to it. Now we ingest 4 times more logging data and have a better query engine and better user adoption.Our migration to Splunk Enterprise has given us a much stronger logging platform overall, but the process was not without its challenges and learnings, which we’ll share in greater detail in this blog.
Legacy logging at Twitter
Before we adopted Splunk Enterprise at Twitter, our central logging platform was a home-grown system called Loglens. This initial system addressed frustrations around the ephemerality of logs living inside of containers on our compute platform and the difficulty in browsing through different log files from different instances of different services while investigating an ongoing incident. This logging system was designed with the following goals:
- Ease of onboarding
- Low cost
- Very little time investment from developers for ongoing maintenance and improvements
Loglens was focused on ingesting logs from services that make up Twitter. In terms of addressing the problems set before it, with the cost and time constraints limiting it, Loglens was a fairly successful solution.
Unsurprisingly, a logging system with such limited resource investment ultimately fell short of users’ expectations, and the system saw suboptimal adoption by developers at Twitter. More information about Loglens can be found on our previous post. Below is a simplified diagram of the Loglens ingestion system. Logs were written to local Scribe daemons, forwarded onto Kafka, and then ingested into the Loglens indexing system.
While running Loglens as our logging backend, we ingested around 600K events per second per datacenter. However, only around 10% of the logs submitted, with the remaining 90% discarded by the rate limiter.
This Tweet is unavailable
Transitioning to Splunk Enterprise at Twitter
Given the title of this blog post, it’s not exactly a spoiler to say we replaced Loglens with Splunk Enterprise. Our adoption of Splunk Enterprise grew out of an effort to find a central logging system that could handle a wider set of use cases, including system logs from the hundreds of thousands of servers in the Twitter fleet and device logs from networking equipment. We ultimately chose Splunk because it can scale to meet the capacity requirements and offers flexible tooling that can satisfy the majority of our users’ log analysis needs.
Our transition to Splunk Enterprise was simple and straightforward. With system logs and network device logs being new categories for us to ingest, we were able to start fresh. We installed the Splunk Universal Forwarder on each server in the fleet, including on some new dedicated servers running rsyslog to relay logs from network equipment to the universal forwarder.
The most complicated part of the transition was migrating the existing application logs from our Loglens to Splunk Enterprise. Luckily, with the loosely coupled design of the Loglens pipeline, the majority of our team’s effort was in the simple task of creating a new service to subscribe to the Kafka topic that was already in use for Loglens, and then forward those logs on to Splunk Enterprise.
Perhaps a little unimaginatively, we named the new service the Application Log Forwarder (ALF). Our main goals when designing ALF were as follows:
- ALF should be resilient. That is, event submission should be resilient when faced with intermittent er