Co-authored by Kelly Shortridge and Ryan Petrich
We hear about all the ways to make your deploys so glorious that your pipelines poop rainbows and services saunter off into the sunset together. But what we don’t see as much is folklore of how to make your deploys suffer.1
Where are the nightmarish tales of our brave little deploy quivering in their worn, unpatched boots – trembling in a realm gory and grim where pipelines rumble towards the thorny, howling woods of production? Such tales are swept aside so we can pretend the world is nice (it is not).
To address this poignant market painpoint, this post is a cursed compendium of 69 ways to fuck up your deploy. Some are ways to fuck up your deploy now and some are ways to fuck it up for Future You. Some of the fuckups may already have happened and are waiting to pounce on you, the unsuspecting prey. Some of them desecrate your performance. Some of them leak data. Some of them thunderstrike and flabbergast, shattering our mental models.
All of them make for a very bad time.
We’ve structured this post into 10 themes of fuckups plus the singularly horrible fuckup of manual deploys. For your convenience, these themes are linked in the Table of Turmoil below so you can browse between soul-butchering meetings or existential crises. We are not liable for any excess anxiety provoked by reading these dastardly deeds… but we like to think this post will help many mortals avoid pain and pandemonium in the future.
The Table of Turmoil:
- Identity Crisis
- Loggers and Monitaurs
- Playing with Deployment Mismatches
- Configuration Tarnation
- Statefulness is Hard
- Net-not-working
- Rolls and Reboots
- Disorganized Organization
- Business Illogic
- The Audacity of Spacetime
- Manual Deploys
Identity Crisis
Permissions are perhaps the final boss of Deployment Dark Souls; they are fiddly, easily forgotten, and never forgiven by the universe.
1. Allow all access.
“Allow all access” is simple and makes deployment easy. You’ll never get a permission failure! It makes for infinite possibilities! Even Sonic would wonder at our speed!
And indeed, dear reader, what wonder allow *
inspires… like a wonder for what services the app actually talks to and what we might need to monitor; a wonder for what data the app actually reads and modifies; a wonder for how many other services could go down if the app misbehaved; and a wonder for exactly how many other teams we might inconvenience during an incident.
Whether for quality’s sake or security’s, we should not trade simplicity today for torment tomorrow.
2. Keys in plaintext.
Key management systems (KMS) are complex and can be ornery. Instead of taming these complex beasts – requiring persistence and perhaps the assistance of Athena herself to ride the beast onward to glory – it can be tempting to store keys in plaintext where they are easily understandable by engineers and operators.
If anything goes wrong, they can simply examine the text with their eyeballs. Unfortunately, attackers also have eyeballs and will be grateful that you have saved them a lot of steps in pwning your prod. And if engineers write the keys down somewhere for manual use in an “emergency” or after they’ve left the company… thoughts and prayers.
3. Keys aren’t in plaintext, they’re just accessible to everyone through the management system.
You’ve already realized storing keys in plaintext is unwise (F.U. #2) and upgraded to a key management system to coordinate the dance of the keys. Now you can rotate keys with ease and have knowledge of when they were used! Alas, no one set up any roles or permissions and so every engineer and operator has access to all of the keys.
At least you now have logs of who accessed which keys so you can see who possibly leaked or misused a key when it happens, right? But how useful are those logs when they are simply a list of employees that are trusted to make deploys or respond to incidents?
4. No authorization on production builds.
The logical conclusion of fully automated deployments is being able to push to production via SCM operations (aka “GitOps”). Someone pushes a branch, automation decides it was a release, and now you have a “fun” incident response surprise party to resolve the accidental deploy.
One option is to enforce sufficient restrictions on who can push to which branches and in what circumstances. Or, you can go on a yearlong silent meditation retreat to cultivate the inner peace necessary to be comfortable with surprise deployments.
The common “mitigation” “plan” is to only hire devs who have a full understanding of how git works, train them properly2 on your precise GitOops workflow, and trust that they’ll never make a mistake… but we all know that’s just your lizard brain’s reckless optimism telling logic to stop harshing its vibes. Make it go sun on a rock somewhere instead.
5. Keys in the deploy script… which is checked into GitHub.
Sometimes build tooling is janky and deployment tooling even jankier. After all, you don’t ship this code to users, so it’s okay if it’s less tidy (or so we tell ourselves). Because working with key management systems can be frustrating, it’s tempting to include the keys in the script itself.
Now anyone who has your source code can deploy it as your normal workflow would. Good luck maintaining an accurate history of who deployed what and when, especially when the “who” is the intern who git clone
’d the codebase and started “experimenting” with it.
6. New build uses assets or libraries on a dev’s personal account.
You’ve decided that developers should be free to choose the best libraries and tools necessary to get the job done, and why shouldn’t they? For many, this will be a homegrown monstrosity that has no tests or documentation and is written in their Own Personal Style™. The dev who chose it is the only one who knows how to use it, but it’s convenient for them.
But is it the most convenient choice for everyone else? What about when the employee leaves and shutters their github account? The supply chain attack3 is coming from inside the house!
7. Temporary access granted for the deployment isn’t revoked.
As MilTOR Freedmem quipped years ago, “Nothing is so permanent as a temporary access token.”
The deployment is complicated and automating all of the steps is a lot of work, so the logical path is to deploy the service manually just this once. The next quarter, there’s an incident and to get the system operational again, it’s quickest to let the team lead log in and manually repair it.
But after the access is added, it’s all too easy to overlook removing the access. Employees would never take shortcuts or abuse their access, right? And their accounts or devices could never be compromised by attackers, right?
8. Former employees can still deploy.
Leadership claims your onboarding and offboarding checklists are exhaustive and followed perfectly every time. And, indeed, your resilience and security goals rely on them being followed perfectly. A safety job well done! No one will be able to deploy your application after they’ve put in notice!
What’s that? That wasn’t part of your checklist, too? Or did you skip over that item because it’s too hard to rotate the keys if some employees quit because they’re too essential and baked into too many systems?
You’ve replaced those keys but they aren’t destroyed and aren’t revoked and don’t expire, so your only hope now is the org didn’t piss off the employees enough for them to YOLO rage around in prod. Sure, former employees have always expressed goodwill towards your company and no one has ever left disgruntled… but would you bet on that staying true?
9. Your app uses credentials associated with the account of the employee you just fired.
Sharing credentials isn’t just something engineers and operators share between themselves. If you’re extra lucky, they’ll bake them into the software or services and then when they leave or transfer to a new department, the system will fail when their permissions are revoked. Maybe sharing isn’t caring.
10. Login tokens are reset and users get frustrated and churn as a result.
Some businesses run on engagement. The more users interact with the platform, the more they induce others to interact, which means more advertising messages you can show them with a more precise understanding of what they might buy. Teams track engagement metrics closely and every little design change is justified or rescinded by how it performs on these metrics. It’s a merry-go-round of incentives and dark patterns.
But one day you migrate to a new login token format or seed, forcing everyone to log in again and the metrics are fucked because many users don’t want to go to the trouble. Those fantastic growth numbers you hoped would bolster your company’s next VC round no longer exist because you broke the cycle of engagement addiction.
Loggers and Monitaurs
Logging and monitoring are essential, which is why getting them wrong wounds us like a Minotaur’s horn through the heart.
11. Logs on full blast.
Systems are hard to analyze without breadcrumbs describing what happened, so logging is an essential quality of an observable system.
Ever-lurking in engineering teams is the natural temptation to log more things. You might need some information in a scenario you haven’t thought of yet, so why not log it? It will be behind the debug level anyway, so it does no harm in production…
…until someone needs to debug a live instance and turns the logging up to 11. Now the system is bogged down by a deluge of logging messages full of references to internal operations, data structures, and other minutia. The poor soul tasked with understanding the system is looking for hay in a needlestack.
Worse, someone could enable debugging in pre-production where traffic isn’t as high4 and not notice before deploying to the live environment. Now all your production machines are printing logs with CVS receipt-levels of waste, potentially flooding your logging system. If you’re extra unlucky, some of your shared logging infrastructure is taken offline and multiple teams must declare an incident.
12. Logs on no blast.
Who doesn’t want peace and quiet? But when logs are quiet, the peace is potentially artificial.
Logs could be configured to the wrong endpoint or fail to write for whatever reason; you wouldn’t even be aware of it because the error message is in the logs that you aren’t receiving. Logs could also be turned off; maybe that’s an option for performance testing5.
Either way, you better hope that the system is performing properly and that you planned adequate capacity. Because if the system ever runs hot or hits a bottleneck, it has no way of telling you.
13. Logs being sent nowhere.
Your log pipelines were set up years ago by employees long gone. Also long gone is the SIEM to which logs were being sent. Years go by, an incident happens, and during investigation you realize this fatal mistake. Your only recourse is locally-saved logs, which, for capacity reasons, are woefully itsy bitsy and you are the spider stuck in a spout, awash in your own tears.
14. Canary is dead, but you didn’t realize it so you deployed to all the servers anyway and caused downtime.
You’ve been doing this DevOps thing awhile and have a mature process that involves canary deployments to ensure even failed updates won’t incur downtime for users. Deployments are routine and refined to a science. Uptime clearly matters to you. Only this time, the canary fails in a way that your process fails to notice.
An alternative scenario is that some part of the process wasn’t followed and a dead canary is overlooked. You miss the corpse that is your new version and kill the entire flock.
Having a process and system in place to prevent failure and then completely ignoring it and failing anyway likely deserves its own achievement award. Do you need a better process, or do you need to fix the tools? How can you avoid this in the future? This will be furiously debated in the post-mortem, which, if blameful rather than blameless will likely result in this failure repeating within the next year.
15. System fails silently.
A system is crying out for help. Its calls are sent into the cold, uncaring void. Its lamentable fate is only discovered months later when a downstream system goes haywire or a customer complains about suspiciously missing data.
“How could it be failing for so long?” you wonder as you stand it back up before adding a “please monitor this” ticket to the team’s backlog that they’ll definitely, totes for sure get to in the next sprint.
16. New version puts sensitive data in logs.
Yay, the new version of the service writes more log data to make it easier to operate, monitor, and debug the service should something ever go wrong! But, there’s a catch: some of the new log messages include sensitive data such as passwords or credit card details. This may not even be purposeful. Perhaps it logs the contents of the incoming request when a particular logging mode is enabled.
Unfortunately, there are very specific rules that businesses of your type must follow when handling certain types of data and your logging pipeline doesn’t follow any of them. Now your near-term plans are decimated by the effort to clean up or redact logs that you otherwise wouldn’t have to if the engineer that added that logging knew about the data handling requirements. By the way, the IPO is in a few months. XOXO.
Playing with Deployment Mismatches
There were assumptions about what you deployed and those assumptions were wrong.
17. What you deployed wasn’t actually what you tested.
Builds are automated and we tested the output of the previous build, so what’s the harm of rebuilding as part of the deployment process? Not so fast.
Unless your build is reproducible, the results you receive may be somewhat different. Dependencies may have been updated. Docker caching may give you a newer (or older, surprisingly!) base image6. Even something as simple as changing the order of linked libraries7 could result in software that differs from what was tested.
Configurations fall prey to this, too. “Well, it works with allow-all!” Right, but it doesn’t work in production because the security policy is different in pre-prod. Or, the new version requires additional permissions or resources which were configured manually in the test environment… but, cranial working memory is terribly finite, and thus they were forgotten in prod.
There are numerous solutions to this problem (like reproducible builds or asset archiving), but you may not bother to employ them until a broken production deploy prompts you to. And some of the solutions descend into a stupid sort of fatalism: “If we don’t have fully reproducible builds, we don’t have anything, there’s no point to any of this.” And then Nietzsche rolls in his grave.
18. Not testing important error paths.
We have to move fast. New features. Tickets. Story points. Ship, ship, ship. Developers with the velocity of a projectile. Errors? Bah, log them and move on.
If something is incorrect, surely it will be noticed in test or be reported by users – spoken by someone who has never faced an angry customer because their data was leaked or discovered their lowest rung employee fuming with resentment when they see the company’s fat margins.
Alas, too often we see a new version which forgets to check auth cookies, roles, groups, and so forth because devs test it as admin with the premium enterprise plan, but forget lowly regular users on the the free tier can’t do and see everything.
19. Untested upgrade path.
Your infrastructure is declarative, but the world is not. The app works in isolation, but doesn’t accept the data from the previous version or behaves weirdly when faced with it.
Possibly the schema has changed, but the migration path for existing data (like user records) was never tested. You didn’t test it because you recreated your environment each time. The new version no longer preserves the same invariants as the old version and you watch in horror as other components in the system topple one by one.
Possibly you’re using a NoSQL database or some other data store for which there isn’t a schema and now the work of data migration falls on the application… but no one designed or tested for that.
Or, maybe you’re pushing a large number of updates to a rarely used part of your networking stack. For those that are all-in on infrastructure as code (IaC), supporting old schema, data, and user sessions can be a thorny problem.
20. “It’s a configuration change, so there’s no need to test.”
A shocking number of outages spawn from what is, in theory, a simple little configuration change. “How much could one configuration change break, Michael?”
Many teams overlook just how much damage configuration changes can engender. Configuration is the essential connective tissue between services and just about anything that can be configured can cause breakage when misconfigured.
21. Deployment was untested because the fix is urgent.
The clock is ticking and sweat is sopping your brow. Something must be done to avoid an outage or data loss or some other negative consequence. This fix is at least something and this something seems like it should work8. You deploy it now because time is of the essence. It fails and you now have less time or have caused more mess to clean up.
Only in hindsight do you realize a better option was available. Or, maybe the option you chose was the best one, but you made a small mistake. Was the haste worth it?
Urgency changes your decision-making. It’s a well-intentioned evolutionary design of your brain that causes unfortunate side effects when dealing with computer systems. In fact, “urgency” could probably be its own macro class of deploy fails given its prevalence as a factor in them.
22. App wasn’t tested on realistic data.
“Everything works in staging! How could it have failed when we pushed it live? I thought we did everything right by testing the schema migration with our test data and load testing the new version.”
Narrator: The software engineer is in their natural habitat. Observe how they pull at their own hair, a hallmark of their species to signal that something has distressed them. It is very difficult to replicate everything that’s happening in production in an artificial test environment without some sort of replication or replay system. This vexes our otherwise clever engineer.
“It causes a crash!? What kind of deranged mortal would have an apostrophe in their name? Oh, it’s common in some cultures? Hmmm…”
If you keep your service online as you deploy, you should really test your upgrade path under simulated load. If you don’t, you can’t be sure if your planned upgrade process will work or how long it will take.
23. Deploying to the wrong environment accidentally.
When you make deployments easy, it is possible to make deploying to prod too easy. And easy to use doesn’t necessarily mean easy to understand. When a slip of the finger results in code going live, you may want to consider just how far you’ve taken automation and if other parts of your process need to catch up.
Because one day, a sleep-deprived Future You is going to run a deploy script where you have to pass in an environment name and you will type dve
instead of dev
. Once it dawns on you that the deploy system falls back to “prod” as the default, adrenaline shocks you awake with the force of 9000 espressos and you will never sleep again.
The regrettable reality is that internal tools often offer terrible UX because engineers refuse to give themselves nice things (including therapy). These tools, akin to a rusty sword with no hilt, make these sorts of failures tragically common. The rise of platform engineering is hopefully an antidote to this phenomenon, treating software engineers as user personas worthy of UX investments, too.
24. No real pre-production environment.
You have a staging environment (congrats!), but it’s an ancient clone from production which has seen so many failed builds, bizarre testing runs, and manual configs that it bears only a pale resemblance to the system it’s supposed to epitomize. It gives you confidence that your software could deploy successfully, but not much else.
You wish you could tear it down and rebuild it anew, but everyone’s busy and it’s never quite important enough for someone to start working on it rather than some other task. Thus you’re doomed to clean up small messes that could be caught by a true staging environment.
At the next DevOps conference you attend, every keynote speaker refers to the “fact” that “everyone” has a “high-fidelity” staging environment (“obviously”) as you weep in silence.
25. Production OS has a different version than pre-prod OS and the app fails to start.
Production systems are incredibly important and we must patch frequently to keep them in compliance. But the same diligence isn’t applied to pre-prod, development, build and other environments.
The systems in these environments may therefore be wildly out of date and the software they produce may be incompatible with the up-to-date, patched production system. Systems will drift so far from the standard that QA systems look like an alternate reality from production and make you a believer in the multiverse hypothesis.
26. Backup botch-ups.
A production deploy requires a backup because hot damn have we fucked it up so many times and a backup makes everyone feel more confident. The administrator responsible for performing the backup writes the backup over the live system, causing an outage. Furthermore, because the data was overwritten by the botched backup, any existing backups are not recent.
Backup fuckups happe