Twitter supposedly lost around 80% of its work force. What ever the real number is, there are whole teams with out engineers on it now. Yet, the website goes on and the tweets keep coming. This left a lot wondering what exactly was going on with all those engineers and made it seem like it was all just bloat. I’d like to explain my little corner of Twitter (though it wasn’t so little) and some of the work that went on that kept this thing running.
For five years I was a Site Reliability Engineer(SRE) at Twitter. For four of those years I was the sole SRE for the Cache team. There was a few before me, and the whole team I worked with, where a bunch came and went. But for four years I was the one responsible for automation, reliability and operations in the team. I designed and implemented most of the tools that are keeping it running so I think I’m qualified to talk about it. (There might be only one or two other people)
A cache can be used to make things faster or to alleviate requests from something that is more expensive to run. If you have a server that takes 1 second to respond, but it’s the same response every time you can store that response in a cache server where the response can be served in milliseconds. Or, if you have a cluster of servers where serving 1000 requests a second might cost $1000, you can instead use the cache to store the responses and serve it from that cache server instead. Then you would have a small cluster for $100 and a cheap and large cache cluster of servers maybe for another $100. The numbers are just examples to illustrate the point.
The caches took on most of the traffic the site saw. Tweets, all of the timelines, direct messages, advertisements, authentication, all were served out the Cache team’s servers. If something went wrong with Cache, you as a user would know, the problems would be visible.
When I joined the team the first project I had was to swap old machines that were being retired for new machines. There were no tools or automation to do this, I was given a spreadsheet with server names. I am happy to say operations on that team is not like that anymore!
T