It was 2 AM when my phone started buzzing. Half-asleep, I grabbed it and saw an alert:
High error rate detected. Immediate action required.
I rushed to my laptop and opened the dashboard. Red. Everything was red. Our system, which had been running smoothly for months, was suddenly failing.
I scanned the logs. They made no sense… just a mess of errors, and unhelpful messages. Users were flooding support with complaints.
After hours of debugging, we found the problem… One bug had taken down an entire service. We fixed it, deployed it and everything was up and running again.
That night, I realised that we didn’t have good enough observability of our system.
If you ask most engineers if they have good monitoring, they’ll say yes.
They have dashboards. They have alerts. They have logs.
But then something breaks, and suddenly, they’re playing detective.
-
Digging through logs.
-
Restarting services.
-
Guessing.
Most teams don’t see problems until they explode. They don’t catch small failures before they snowball into disasters.
And that’s the real problem. Software is messy. Small failures happen all the time. But if you can’t see them, you can’t stop them.
If you enjoy posts like this, consider supporting my work and subscribing to this newsletter.
As a free subscriber, you get:
✉️ 1 post per week
🧑🎓 Access to the Engineering Manager Masterclass
As a paid subscriber, you get:
🔒 50 Engineering Manager templates and playbooks (worth $79)