The familiar blue and gold intro graphic fills the screen every evening at six o’clock on the dot. The jabbing staccato string music conjures up vague secondhand memories of what a teletype machine might have sounded like. A high angle view of the studio floor with the large Lexan-clad desk in the middle, then a cross dissolve to a two shot of the presenters for this newscast. The music fades, each person introduces themselves, then they jump straight into the top story for the evening. It’s been this way for as long as anybody can remember. They’ve never failed to get this show on the air.
They’ve never failed.
Who you gonna call?
Everything fails all the time.
Producing any sort of live television show is a complex ballet. The studio’s cameras and microphones route their signals into video switchers and audio mixers, pre-taped packages come from the video server, field reporters are connected bidirectionally through a satellite link, and with a sprinkling of pizazz from the motion graphics machine, the final product is sent off to master control and ultimately to all the kitchen counters and family rooms across the city.
But there are ancillary systems outside of this direct pipeline. The studio lighting is quite important, as most professional broadcast cameras tend to produce underwhelming images under inadequate light. The teleprompters feeding the anchors their scripts are obviously important. The weather reporting segments use an entirely separate system of graphics rendering equipment that must be linked through a chroma keyer to place the meteorologist in front of the computer-generated forecast images. And quite obviously, this equipment requires a handful of human operators.
Studio-grade equipment is obscenely expensive, but it is also incredibly reliable. It is rare for things to outright fail, but anything can eventually wear out after enough daily use. If a camera fails, perhaps they can wheel the one from the sports desk over to cover this part of the broadcast. If the teleprompters fail, the anchors have a copy of the script at their desk that they can look down at. If one of the anchors calls out sick, they can sub in talent from the morning news team.
Each of these is an example of either a redundant backup system or spare capacity that can be reallocated if needed. The broadcast technically does not need any of these contingencies to function under normal circumstances, but in cases where things go wrong it can mean the difference between success and total failure.
Not everything can be made completely redundant. A failure in the power system for the lights will most likely plunge the entire studio into darkness, and that’s no way to run a news program. Similarly, if the $50,000 video switcher dies, it’s highly unlikely that they’ll have a spare holed up in the supply closet. To insure against every possible thing that could ever go wrong, they would have to build a second studio on a separate part of the city’s electric grid, with redundant copies of all the equipment and broadcast content, along with a full crew of understudies ready to take over at a moment’s notice. This is a degree of redundancy that can’t reasonably be achieved by any budget-conscious station.
There is a hybrid between the two options, allowing the station to only maintain a single instance of anything expensive while having some assurance that the equipment they do have will work when needed: They can find an expert of some sort who is capable of fixing anything that breaks well enough to get the broadcast out. We’ll name this person Alex. If the microphone battery dies, Alex will swap it out. If the video server acts up, Alex knows how to get it working again. If the tire pressure light in the Chevrolet Weather Beast comes on, or the studio’s air conditioning fails, or the technical director breaks both their hands and needs somebody to push the buttons on their behalf, it’s Alex’s time to shine.
Now, naturally, most of the time everything is going fine and Alex has nothing to do. So Alex has some other regular job in the studio—say running the audio mixer. In fact, the audio mixer thing is their official job title and their primary responsibility at the station; they only jump into universal-problem-solving mode when something goes wrong. As soon as the problem is resolved, it’s back to the audio mixer.
The other thing about all this is, well, it’s very difficult to find and train people like Alex. So since they are at the station all evening anyway, why not also have them stick around in case anything goes wrong during the 7:00 news, and 11:00? And if anything happens during the 4:30–7 a.m. news, the station can call Alex at home and have them bop over and fix the problem. Oh, and also the news at noon, and the 4 p.m. block. Apparently this station broadcasts six hours of live news programming most days. At least it’s only four hours on Sunday. In the station’s view, there is no need for anybody to relieve Alex because—most of the time—they never need Alex’s emergency response skills at all. There should be no need to hire and train somebody else to do this stuff because they barely use the services of the person they already have.
There is, of course, another option that the station has never seriously entertained: Don’t hold Alex to any of those responsibilities at all, and if things really go to hell they can just throw on an old The Price Is Right rerun and hope for better luck during the next scheduled newscast.
Grandpa, what’s a beeper?
1-800-759-7243
But if you ain’t got that pin number, dummy, you can’t call me
To hook up with Mix you gotta call that number
Then sit by the phone and wonder
Will he call? If you’re fine I might
If you’re a duck, good night
There was a time—not that long ago, really—where people couldn’t contact you if they didn’t know where you were. Telephones were literally screwed into the walls of houses and businesses. Portable two way radios existed, but they were a massive pain to carry around and operate. If somebody wished to contact you, they would not call you specifically but rather your house or your workplace, places where you might or might not have been at the time. If you were not there, maybe they’d try to call your brother’s house, your favorite bar, the Kiwanis club, or another location that was significant to you. If they still couldn’t find you, eventually they’d give up. People used to be more chill in that way.
In a more structured environment—say a hospital where doctors moved from room to room but stayed inside one building—it was important to be able to get in touch with a specific person without knowing which room they were in. To accomplish this, a phone operator would page This verb form of the word “page” uses the same sense as the noun “page,” an old-timey word meaning roughly “servant boy.” I page you in the sense that I am asking Kenneth, the NBC page from 30 Rock to send for you. the desired person via an announcement over the building’s public address speakers: “Paging Dr. Johnson, Dr. Johnson, please call fourth floor nurse station.” Assuming Dr. Johnson was in the building to hear this, they would find a phone and call the station as instructed.
This worked fine, but it generated a lot of “useless” noise because most of the staff were uninvolved in most of the pages they overheard. Thanks to incremental improvements in technology, the voice announcements were phased out to make way for unidirectional radio broadcasts that covered the entire building. The content of the radio message remained the same as the audible announcement: who the page was for, and who that person needed to contact in response. Each person who needed to receive pages was given a pager, a radio receiver that was pre-programmed to only activate in response to pages specifically addressed to it. Each pager contained a small numeric display where the information about who to call could be shown. These were colloquially called beepers because, well, they made a beeping sound to announce each incoming page.
To send a page, a person would pick up one of the building’s telephones and dial the number for the paging system. They would be prompted to enter the recipient’s PIN or unique identification code along with a callback number. If the sender wanted the recipient to call them directly, the callback number would be a phone that the sender was ready to pick up. It didn’t have to be, though. For example, the sender and recipient could have a prearranged system in which a code like “505” could be interpreted as the distress signal SOS with some mutually understood meaning. These codes were more common from senders that the recipient knew well, representing messages they frequently needed to exchange. To a building maintenance worker, “234” could indicate an emergency at 234 Maple Avenue while “5300” could have been 5300 Elm Street. The codes meant what the sender and recipient agreed they meant.
Technology got better. Things got smaller and faster. The unidirectional pager networks started becoming overshadowed by mobile phone networks which soon gained the ability to send bidirectional SMS messages. Microprocessors advanced to the point where a battery-operated handheld device could serve as a phone that could also send and receive text messages. These advances made it possible to send longer messages using a more expressive character set on a device that also did other things. My very first mobile phone could run a game of Snake that objectively blew. But the capabilities were there. Phones continued to gain capabilities, the networks they ran on continued to get faster with wider coverage, but the central thread of “I need to get this message to that device” is as clear today as it was when Sir Mix-A-Lot was courting his lady friends in the 1980s.
Also, the systems described up to this point had one thing in common: The person sending the page was a human being.
Getting on the same page
Dude: They gave Dude a beeper, so whenever these guys call—
Walter: What if it’s during a game?
Dude: Oh, I told them if it was during league play…
Donny: What’s during league play?
Walter: Life does not stop and start at your convenience, you miserable piece of shit.
Like a disheartening number of things in the tech industry, there are no real standards around what on-call responsibilities look like. Each organization And each team within! is free to set things up in whichever way suits their tastes, and the resulting practices vary widely as a result. In order to ground this article in something concrete, I will describe Alex’s on-call arrangement, which seems to be typical for US companies whose business model is “Have a website and/or mobile app, and either put ads all over it or convince the users to enter their credit card information somewhere to use it.” The prevailing attitude of these organizations is that the product must work at all times, otherwise it results in failure to show an ad or collect a payment. Both of these negatively affect revenue.
Alex’s company uses the SEV system, which might Again, no standards. Somebody copied part of the philosophy from Amazon or Facebook or someplace but never bothered to codify exactly what the abbreviation meant to them. mean “severity,” “site event,” “significant event,” “serious event,” or anything else you might care to contrive that matches the pattern. SEVs are further divided into numbered classes depending on their impact on the product experience; a SEV 1 means that the business is currently failing to be a business because it is unable to perform its core functions and/or collect its revenue. The lesser SEV 3 might represent degraded performance on some non-critical portion of the application. An example of a SEV 3 might be a situation where users can still change their profile pictures, but those changes are not showing up promptly in the app due to some kind of processing delay. This will probably not impact the quarterly financial statement in a measurable way. An instance of a SEV 1, on the other hand, might entail the mobile app showing a perpetual loading spinner on every request to every user at once. That type of thing tends to get noticed.
Below the SEV system, there is a bubbling churn of things that are subtly broken, or are well on the way to someday being definitely broken, but are fine for the time being. A good example of this would be a disk that is 98% full. In its current state, nothing is actually wrong. But once it finally becomes 100% full and cannot accept any more data, something else in the system is going to respond poorly and this can likely cascade into some kind of SEV. Most systems in most organizations have monitoring in place for this sort of thing, and it is common for an on-call engineer to receive pages due to (e.g.) high disk usage to investigate specifically to avoid a potential SEV in the future. Practically all pages of this nature are generated and sent through automated means, and these pages can sometimes resolve themselves without outside intervention if (e.g.) the disk usage abates naturally.
The on-call engineer in Alex’s department is selected out of a rotation of all the team members. The on-call shift is seven consecutive days of 24-hour support, or 168 solid hours. ±1 hour depending on how daylight saving time shakes out. The on-call engineer does not need to stay awake for seven straight days; the idea is that they’re supposed to work on typical tasks during business hours and go about their non-work lives as usual, but be able to jump into handling an issue quickly after receiving any page at any time. The “quickly” part is formally defined as time to acknowledge, and durations from 5 to 30 minutes are fairly typical. Alex’s team expects pages to be acknowledged within 15 minutes.
If a page is not acknowledged by the on-call engineer, a system of escalation begins. The escalation policy usually follows one of these patterns:
- If there is only a single on-call engineer, the page may escalate to them again. This re-raises the original alert in case it was somehow missed the first time.
- In a “primary/secondary” type of arrangement, there are actually two people on-call at any given moment. All pages go to the primary, and only unacknowledged pages escalate to the secondary. If the secondary doesn’t acknowledge the page either, it may escalate further as described by the other bullet points here.
- In a “hunt group” configuration, an unacknowledged page is sent to every member of the team—none of whom are officially on-call at the moment—in the hopes that one of them is free to acknowledge and handle the issue. This arrangement has a strong tendency to break down into one of two degenerate states:
- One or a few people naturally become highly responsive to all pages, acknowledging them before most of their teammates have the opportunity to do so. Over time, most of the team members stop paying attention to pages and leave their highly-responsive peers to handle everything that comes in.
- Something very close to the bystander effect occurs, where everybody in the group assumes somebody else
16 Comments
dylan604
" insure against every possible thing that could ever go wrong, they would have to build a second studio on a separate part of the city’s electric grid, with redundant copies of all the equipment and broadcast content, along with a full crew of understudies ready to take over at a moment’s notice."
WTH?? I guess this person has never heard of backup generators? Every broadcast TV station has them.
boznz
What they don't tell you about working for yourself is the fact you can be effectively on-call 24×7 every day. I am currently supporting four wineries that are processing thousands of tonnes of receivals 24×7. It happens for two months of the year and I am expected to be available from 06:00 to 22:00 during that time, there is no phoning in sick or having a lazy day, I work alone and only have one reputation. I don't want to be that contractor forever known for destroying a clients business.
You can only do this for so long though, when two or three problems come in simultaneously it can cause issues as you drop something halfway through when something more important comes in. I once executed an SQL update query without a where clause under this kind of pressure, and ended up working until the next morning to recover, only to start again at 6AM. I have even had land-line calls at 2AM to bypass my mobile restrictions. The rewards are great, but don't let anyone tell you it is always easy.
My current system is 16 years old now and I know all the ins and outs so it has been pretty easy to keep on top of things the last several years, however I am glad the replacement system is nearly written and it will be somebody else problem in 2026.
flerchin
Jeez I guess what we do is industry standard best practice, and it sucks.
yodsanklai
Excellent article. I can relate to a lot of it. The sad part is that we can't even control the quality of the systems we're oncall for. We're pushed by management for new features, not for robustness of the tools. Also some systems have no clear ownership, so nobody has an incentive to fix them. It'll be next oncall's business. Oncall is really the worst part of my job. I can stand long hours but this is something else.
Kwpolska
That seems like a very long-winded way to say you hate on-call, which is a completely normal thing to do. That said, is on-call effectively mandatory or very popular in the US startup world? Because here, in the European established company world, I can’t really recall seeing a job posting with on-call listed.
slt2021
being oncall forces the quality of software to improve.
if you want fewer incidents: ensure better QA, monitoring, smaller rollouts
usually developers start becoming more conservative after they do few oncall shifts and suddenly prioritize important reliability improvements, instead of shiny new features nobody will use
Animats
For "non-exempt" employees, that's paid "stand-by time" California.[1] Also see this case involving on-call coroners.[2]
The way this works in most unionized jobs is that there's a stand-by rate paid for on-call hours, plus a minimum number of hours at full or overtime pay, usually four, when someone is called to duty. This is useful to management – if the call frequency is too high, it becomes cheaper to hire an additional person.
[1] https://www.dir.ca.gov/dlse/CallBackAndStandbyTime.pdf
[2] https://casetext.com/case/berry-v-county-of-sonoma
mjcarden
This article gave me unpleasant flashbacks to the first half of 2023. I resigned from planet.com in mid 2023 due to the stress caused by being on-call every second week. It took me six months to get my head into a healthy state again. Now I have a much better job, better paid and no possibility of on-call, ever.
purplejacket
Here's an idea: Compensate any on-call work received during off hours at 10X the normal hourly rate. E.g., if my salary is $150K per year, then my hourly pay rate is about $75 per hour, so compensate my on call work at a rate of $750 per hour. Thus if I get a call at 10pm, log in to my laptop and work for 30 minutes to resolve the issue to a satisfactory level, then I pocket $375. That puts a financial incentive on companies to structure their on call protocols so that only the most important calls are handled. And I can envision variations on this theme. Different sorts of on-call disasters could offer bids for how much they're worth to fix based on some automated rubrick, and anyone on the ENG team could pick these up on a first-come, first-serve basis. Or various combinations of the above for a guaranteed backup person. But the companies should offer enough incentive to make it worthwhile. And this is in the companies' own best interest. To maintain a workforce that can think clearly during the normal work, to have a good reputation in the industry, to get good reviews on Glassdoor, etc.
dadkins
I just want to point out that the answer is shift work. Here's an example of an SRE job at a national lab:
https://lbl.referrals.selectminds.com/jobs/site-reliability-…
"Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 – 3 OWL (midnight – 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs."
40 hours per week, full salary, full disclosure about the night shifts, but none of this 24×7 wake up in the middle of the night on top of your regular job bullshit that the tech industry insists on.
smackeyacky
That article made me shudder with echos of having what we used to call “beeper madness” back in the 1990s. After a while of being on a roster of on call weeks, anything that beeped would make you reach for that pager on your belt.
As a kid the first few weeks were kind of exciting as it felt like you had been elevated to a new level of responsibility. Once that wore off it was obvious what a cage it was.
I don’t miss pagers.
chris_wot
There is another option. The on-call person just does a deliberately piss-poor job of resolving the problem. I mean, they resolve it but they make sure it takes a hour longer than necessary.
What are they going to do, fire you? If they make life hard for you, then get another job. The shoddier your work outside of your normal hours, the better. You can have quality, speed and cheapness, but you can only pick two.
denkmoon
I was on call in a 4 man startup for a 1 week rotation for about 9 months, 6 years ago. I still have an anxiety reaction when my phone rings. Can very much relate to the author's thoughts about PTSD.
nhumrich
The difference with dev oncall vs doctor on call is that it is self inflicted.
Why are you getting paged? Because you built the system.
Either your system isn't resilient enough, or you have noisy alerts. Both are problems you should be motivated to fix.
I have been on call 24/7/52 in SRE roles most of my career. It has either sucked hard, or not at all. And the time it sucked the most was because every single practice was bad. And now, I build better things because of if. Paying me more for on call wouldn't have changed how much it sucked. It wouldn't have made any material impact on my actual quality of life. But it would have done two things:
1) made me feel like I can't complain
2) give me less motivation to fix it
Paying for on call doesn't seem like a win. I want happy employees, not disgruntled but silent ones.
lopatin
The OP needs to write with some more focus. Most of this reads like a very long rant by someone who was woken up too many times recently.
> We need to talk about Kafka
No we don't, that entire section was irrelevant.
nullorempty
Yeah, I can relate to people saying they nearly got PTSD, I sure did get it. Paging apps use seriously offensive alarm sounds. I hated every sound they had in the options. It made me instantly sick. Fuck that!