In our previous discussion, we covered the initial five lessons from the eleven lessons shared by Google’s Site Reliability Engineering (SRE) team, reflecting on their two decades of expertise.
Today, we will explore the remaining six lessons from their post. Our aim is to expand upon each lesson, offering our insights and perspectives to enhance understanding.
Additionally, we aim to connect theoretical concepts with practical application, offering a more knowledgeable approach to addressing SRE challenges. We will try demonstrate how these lessons can be applied to keep and maintain the health of your systems. We plan to use real-world examples to deepen your understanding of these principles, helping you improve your methods for ensuring system uptime and operational efficiency.
Communication Backup
It’s crucial to have multiple reliable channels of communication, especially during incidents and outages. This was evident when AWS’s us-east-1 region experienced downtime, affecting Slack and numerous systems of companies relying on it as their primary chat platform. In such scenarios, not only do we face the challenge of an outage, but also the inability to communicate as a team to address or resolve the issue effectively.
COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
Here are some tips on communication backups:
- Have both in-band (email, chat) and out-of-band (SMS, phone) communication channels set up for critical teams. Don’t rely solely on one channel.
- Ensure contact information is kept up-to-date across all systems. Stale contacts amplify chaos.
- Document and test backup contacts and escalation procedures. Know who to reach if the first contact doesn’t respond.
- Geo-distribute and fail-over communication systems so there is no single point of regional failure.
- Regularly test backup communication channels to ensure they are functioning, not forgotten. Rotate testing.
- Audit access permissions and multi-factor authentication to avoid getting locked out of channels.
- Support communication across multiple teams like technical, leadership, PR, customer support. Coordinate flows.
- Have a common lexicon for major incidents so terminology is consistent. Helps reduce confusion.
- Quickly establish incident chat channels but also document decisions in durable ticket systems.
Redundant comms are indispensable for coordination during chaos. Invest in multiple channels, rigorous testing, and distribution to keep conversations flowing when systems fail.
Degradation Mode In Production
if your customers or systems are used to a specific performance, they rely on this performance and become hidden service level objective , and not the slo that you have promised them. this is known as Hyrum’s law:
all behaviors of an API will be depended upon, regardless of whether they are voluntary or not.
Therefore, it’s important to operate not only in various performance modes in production but also to run in a degraded mode when possible. Doing so helps you understand the limits of your systems and challenges your assumptions about potential outcomes under different conditions.
Intentionally degrade performance modes
Here are some ways to intentionally build in degraded performance modes to add resilience:
- Feature flags to quickly disable or throttle non-critical functions to preserve core services.
- Config parameters to ra