Summary
In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a cell-based architecture over the last 1.5 years. In this series of blog posts, we’ll discuss our reasons for embarking on this massive migration, illustrate the design of our cellular topology along with the engineering trade-offs we made along the way, and talk about our strategies for successfully shipping deep changes across many connected services.
Background: the incident

At Slack, we conduct an incident review after each notable service outage. Below is an excerpt from our internal report summarizing one such incident and our findings:
At 11:45am PDT on 2021-06-30, our cloud provider experienced a network disruption in one of several availability zones in our U.S. East Coast region, where the majority of Slack is hosted. A network link that connects one availability zone with several other availability zones containing Slack servers experienced intermittent faults, causing slowness and degraded connections between Slack servers and degrading service for Slack customers.
At 12:33pm PDT on 2021-06-30, the network link was automatically removed from service by our cloud provider, restoring full service to Slack customers. After a series of automated checks by our cloud provider, the network link entered service again.
At 5:22pm PDT on 2021-06-30, the same network link experienced the same intermittent faults. At 5:31pm PDT on 2021-06-30, the cloud provider permanently removed the network link from service, restoring full service to our customers.
At first glance, this appears to be pretty unremarkable; a piece of physical hardware upon which we were reliant failed, so we served some errors until it was removed from service. However, as we went through the reflective process of incident review, we were led to wonder why, in fact, this outage was visible to our users at all.
Slack operates a global, multi-regional edge network, but most of our core computational infrastructure resides in multiple Availability Zones within a single region, us-east-1. Availability Zones (AZs) are isolated datacenters within a single region; in addition to the physical isolation they offer, components of cloud services upon which we rely (virtualization, storage, networking, etc.) are blast-radius limited such that they should not fail simultaneously across multiple AZs. This enables builders of services hosted in the cloud (such as Slack) to architect services in such a way that the availability of the entire service in a region is greater than the availability of any one underlying AZ. So to restate the question above — why didn’t this strategy work out for us on June 30? Why did one failed AZ result in user-visible errors?
As it turns out, detecting failure in distributed systems is a hard problem. A single Slack API request from a user (for example, loading messages in a channel) may fan out into hundreds of RPCs to service backends, each of which must complete to return a correct response to the user. Our service frontends are continuously attempting to detect and exclude failed backends, but we’ve got to record some failures befo