An epic treatise on scheduling, bug tracking, and triage
- [Note 2017-12-29: the news.ycombinator.com
discussion of this post is unusually useful. You may want to read it
first.]
- [Note 2018-09-01: I presented an
updated and shortened version of this talk at SREcon EMEA 2018. There’s a
recording and some slides that you might enjoy. The text below is much more
detailed though, if you want to actually implement the advice.]
I did a talk at work about the various project management troubles I’ve seen
through the years. People seemed to enjoy it a lot more than I expected,
and it went a bit viral among co-workers. It turns out that most of it was not
particularly specific to just our company, so I’ve taken the slides,
annotated them heavily, and removed the proprietary bits. Maybe it’ll help
you too.
Sorry it got… incredibly long. I guess I had a lot to say!
[Obligatory side note: everything I post to this site is my personal
opinion, not necessarily the opinion of my or probably any other employer.]
Scaling in and out of control
Way back in 2016, we were having some trouble getting our team’s releases out the door. We also had out-of-control
bug/task queues and no idea when we’d be done. It was kind of a mess, so I
did what I do best, which is plotting it in a spreadsheet. (If what I do best was actually solving the problem, I assume
they’d have to pay me more.)
Before 2016 though, we had been doing really quite well. As you can see,
our average time between releases was about two months, probably due
primarily to my manager’s iron fist. You might notice that we had some
periodic slower releases, but when I looked into them, it turned out that
they crossed over the December holiday seasons when nobody did any work. So
they were a little longer in wall clock time, but maybe not CPU time. No
big deal. Pretty low standard deviation. Deming
would be proud. Or as my old manager would say, “Cadence trumps mechanism.”
What went wrong in 2016? Well, I decided to find out. And now we’re finally getting to the point of this presentation.
But before we introspect, let’s extrospect!
Here’s a trend from a randomly selected company. I wanted to see how other
companies scale up, so I put together a little analysis of Tesla’s
production (from entirely public sources, by the way; beware that I might
have done it completely wrong). The especially interesting part of this
graph (other than how long they went while producing only a tiny number of
Roadsters and nothing else) is when they released the Model X. Before that,
they roughly scaled employees at almost the same rate as they scaled
quarterly vehicle production. With the introduction of the Model X, units
shipped suddenly went up rapidly compared to number of employees, and
sustained at that level.
I don’t know for sure why that’s the case, but it says something about improved automation, which they claim to have further improved for the Model 3 (not shown because I made this graph earlier in the year, and anyway, they probably haven’t gotten the Model 3 assembly straightened out yet).
Tesla is a neat example of working incrementally. When they had a small number of people, they built a car that was economical to produce in small volumes. Then they scaled up slowly, progressively fixing their scaling problems as they went toward slightly-less-luxury car models. They didn’t try to build a Model 3 on day 1, because they would have failed. They didn’t scale any faster than they legitimately could scale.
So naturally I showed the Tesla plot to my new manager, who looked at it and asked the above question.
Many people groan when they hear this question, but I think we should give
credit where credit is due. Many tech companies are based around the idea
that if you hire the smartest people, and remove all the obstacles you can,
and get rid of red tape, and just let them do what they feel they need to
do, then you’ll get better results than using normal methods with normal
people. The idea has obviously paid off hugely at least for some companies.
So the idea of further boosting performance isn’t completely crazy.
And I did this talk for an Engineering Productivity team, whose job is, I
assume, literally to make the engineers produce more.
But the reason people groan is that they take this suggestion in the most basic way: maybe we can just have the engineers work weekends or something? Well, I think most of us know that’s a losing battle. First of all, even if we could get engineers to work, say, on Saturdays without any other losses (eg. burnout), that would only be a 20% improvement. 20%? Give me a break. If I said I was going to do a talk about how to improve your team’s efficiency by 20%, you wouldn’t even come. You can produce 20% more by hiring one more person for your 5-person team, and nobody even has to work overtime. Forget it.
The above is more or less what I answered when I was asked this question.
But on the other hand, Tesla is charging thousands of dollars for their
“optional” self-driving feature (almost everybody opts for it), which
they’ve been working on for less time with fewer people than many of their
competitors. They must be doing something right, efficiency wise, right?
Yes. But the good news is I think we can fix it without working overtime. The bad news is we can’t just keep doing what we’ve been doing.
Cursed by goals
Here’s one thing we can’t keep doing. Regardless of what you think of
Psychology Today (which is where I get almost all my dubiously sourced pop
psychology claims; well, that and Wikipedia), there’s ample indication from
Real Researchers, such as my hero W. Edwards Deming, that
says “setting goals” doesn’t actually work and can make things worse.
Let’s clarify that a bit. It’s good to have some kind of target, some idea of which direction you’re going. What doesn’t work is deciding when you’re going to get there. Or telling salespeople they need to sell 10% more next quarter. Or telling school teachers they need to improve their standardized test scores. Or telling engineers they need to launch their project at Whatever Conference (and not a day sooner or a day later).
What all these bad goals have in common is that they’re arbitrary. Some supposedly-in-charge executive or product manager just throws out a number and everyone is supposed to scramble to make it come true, and we’re supposed to call that leadership. But anybody can throw out a number. Real leaders have some kind of complicated thought process that leads to a number. And the thought process, once in place, actually gives them the ability to really lead, and really improve productivity, and predict when things really will be done (and make the necessary changes to help them get done sooner). Once they have all that, they don’t really need to tell you an arbitrary deadline, because they already know when you’ll finish, even if you still don’t.
When it works, it all feels like magic. Do you suspect that your competitors are doing magic when they outdo you with fewer people and less money? It’s not magic.
I just talked about why goals don’t really help. This is a quote about how
they can actually hurt, because of (among other things) a famous phenomenon
called the “Student Syndrome.” (By the way, the book this is taken from,
Critical Chain, is a great book on project management.)
We all know how it works. The professor says you have an assignment due in a week. Everyone complains and says that’s impossible, so the prof gives them an extension of an extra week. Then everyone starts work the night before anyway, so all that extra time is wasted.
That’s what happens when you give engineers deadlines, like launching at
Whatever Conference. They’re sure they
can get done in that amount of time, so they take it easy for the first
half. Then they get progressively more panicked as the deadline approaches,
and they’re late anyway, and/or you ship on time and you get what Whatever
Conference always gets, which is shoddy product launches. This is true
regardless of how much extra time you give people. The problem isn’t how
you chose the deadline, it’s that you had a deadline at all.
Stop and think about that a bit. This is really profound stuff. (By the way, Deming wrote things like this about car manufacturers in the 1950s, and it was profound then too.) We’re not just doing deadlines wrong, we’re doing it meta-wrong. The whole system is the wrong system. And there are people who have known that for 50+ years, and we’ve ignored them.
I wanted to throw this in too. It’s not as negative as it sounds at first.
SREs (Site Reliability Engineers) love SLOs (Service Level Objectives). We’re supposed to
love SLOs, and hate SLAs (Service Level Agreements). (Yes, they’re
different.) Why is that?
Well, SREs are onto something. I think someone over there read some of Deming’s work. An SLA requires you to absolutely commit to hitting a particular specification for how good something should be and when, and you get punished if you miss it. That’s a goal, a deadline. It doesn’t work.
An SLO is basically just a documented measurement of how well the system typically works. There are no penalties for missing it, and everyone knows there are no penalties, so they can take it a bit easier. That’s essential. The job of SRE is, essentially, to reduce the standard deviation and improve the average of whatever service metric they’re measuring. When they do, they can update the SLO accordingly.
And, to the point of my quote above, when things suddenly get worse – when the standard deviation increases or the average goes in the wrong direction – the SLO is there so that they can monitor it and know something has gone wrong.
(An SLI – Service Level Indicator – is the thing you measure. But it’s not useful by itself. You have to monitor for changes in the average and standard deviation, and that’s essentially the SLO.)
I think it’s pretty clear that SRE is a thing that works. SLOs give us some clues about why:
-
SLOs are loose predictions, not strict deadlines.
-
SLOs are based on measured history, not arbitrary targets thrown out by some executive.
We need SLOs for software development processes, essentially.
Schedule prediction is a psychological game
That was a bunch of philosophy. Let’s switch to psychology. My favourite
software psychologist is Joel Spolsky, of the famous Joel on Software blog.
Way back in the early 2000s, he wrote a very important post called “Painless
Software Schedules,” which is the one I’ve linked to above. If you follow
the link, you’ll see a note at the top that he no longer thinks you should
take his advice and you should use his newfangled thing instead. I’m sure
his newfangled thing is very nice, but it’s not as simple and elegant as his
old thing. Read his old thing first, and worry about the new thing later,
if you have time.
One of my favourite quotes from that article is linked above. More than a decade ago, at my first company, I was leading our eng team and I infamously (for us) provided the second above quote in response.
What Joel meant by “psychological games” was managers trying to negotiate with engineers about how long a project would take. These conversations always go the same way. The manager has all the power (the power to fire you or cancel your bonus, for example), and the engineer has none. The manager is a professional negotiator (that’s basically what management is) and the engineer is not. The manager also knows what the manager wants (which is the software to be done sooner). The engineer has made an estimate off the top of their head, and is probably in the right ballpark, but not at all sure they have the right number. So the manager says, “Scotty, we need more power!” and the engineer says, “She canna take much more o’ this, captain!” but somehow pulls it off anyway. Ha ha, just kidding! That was a TV show. In real life, Scotty agrees to try, but then can’t pull it off after all and the ship blows up. That manager’s negotiation skills have very much not paid off here. But they sure psyched out the engineer!
In my silly quote, the psychological games I’m talking about are the ones you want to replace that with. Motivation is psychological. Accurate project estimation is psychological. Feature prioritization is psychological. We will all fail at these things unless we know how our minds work and how to make them do what we want. So yes, we definitely need to play some psychological games. Which brings us to Agile.
I know you probably think Agile is mostly cheesy. I do too. But the problem is, the techniques really work. They are almost all psychological games. But the games aren’t there to make you work harder, they’re there, to use an unfortunate cliche, to make you work smarter. What are we doing that needs smartening? If we can answer that, do we need all the cheesy parts? I say, no. Engineers are pretty smart. I think we can tell engineers how the psychological games work, and they can find a way to take the good parts.
But let’s go through all the parts just to be clear.
-
Physical index cards. There are reasons these are introduced into Agile: they help people feel like a feature is a tangible thing. They are especially good for communicating project processes with technophobes. (You can get your technophobic customers to write down things they want on index cards more easily than you can get them to use a bug tracking system.) Nowadays, most tech companies don’t have too many technophobic employees. They also often have many employees in remote offices, and physical cards are a pain to share between offices. The fedex bills would be crazy. Some people try to use a tool to turn the physical cards into pictures of virtual physical cards, which I guess is okay for what it is, but I don’t think it’s necessary. We just need a text string that represents the feature.
-
Stories & story points. These turn out to be way more useful than you think. They are an incredibly innovative and effective psychological game, even if you don’t technically write them as “stories.” More on that in a bit.
-
Pair programming. Sometimes useful, especially when your programmers are kinda unreliable. They’re like a real-time code review, and thus reduce latency. But mostly not actually something most people do, and that’s fine.
-
Daily standup meetings. Those are just overhead. Agile, surprisingly enough, is not good because it makes you do stupid management fluff every day. It does, but that’s not why it’s good. We can actually leave out 95% of the stupid management fluff. The amazing thing about Agile is actually that you get such huge gains despite the extra overhead gunk it adds.
-
Rugby analogies (ie. SCRUM). Not needed. I don’t know who decided sports analogies were a reliable way to explain things to computer geeks.
-
Strict prioritization. This is a huge one that we’ll get to next – and so is flexible prioritization. Since everyone always knows what your priorities are (and in Agile, you physically post index cards on the wall to make sure they all know, but there are other ways), then people are more likely to work on tasks in the “right” order, which gets features done sooner than you can change your mind. Which means when you do change your mind, it’ll be less expensive. That’s one of the main effects of Agile. Basically, if you can manage to get everyone to prioritize effectively, you don’t need Agile at all. It just turns out to be really hard.
-
Tedious progress tracking: also not needed. See daily standups, above.
Agile accrues a lot of cruft because people take a course in it and there’s
inevitably a TPM (Technical Programme Manager, aka Project
Manager) or something who wants to be useful by following all the steps and
keeping progress spreadsheets. Those aren’t the good parts. If you do this
right, you don’t even need a TPM, because the progress reports write
themselves. Sorry, TPMs. -
Burndown charts. Speaking of progress reporting, I don’t think I’d heard of burndown charts before Agile. They’re a simple idea: show the number of bugs (or stories, or whatever) still needing to be done before you hit your milestone. As
stories/bugs/tasks/whatever get done, the line goes down, a process which we call “burning” because it makes us feel like awesome Viking warlords instead of computer geeks. A simple concept, right? But when you do it, the results are surprising, which we’ll get to in a moment. Burndown charts are our fundamental unit of progress measurement, and the great thing is they draw themselves so you don’t need the tedious spreadsheet. -
A series of sprints, adding up to a marathon. This is phrased maybe too cynically (just try to find a marathon runner that treats a marathon as a series of sprints), but that’s actually what Agile proposes. Sprints are very anti-Deming. They encourage Student Syndrome: putting off stuff until each sprint is almost over, then rushing at the end, then missing the deadline and working overtime, then you’re burnt out so you rest for the start of the next sprint, and repeat. Sprints are also like salespeople with quarterly goals that they have to hit for their bonus. Oh, did I just say goals? Right, sprints are goals, and goals don’t work. Let’s instead write software like marathon runners run marathons: at a nice consistent pace that we can continue for a very long time. Or as Deming would say, minimize the standard deviation. Sprints maximize the standard deviation.
Phew! Okay, that was wordy. Let’s draw a picture.
I made a SWE simulator that sends a bunch of
autonomous SWE (software engineer) drones to work on a batch of simulated bugs
(or subtasks). What we see are five simulated burndown charts. In each of the five cases,
the same group of SWEs works on the exact same bug/task database, which starts
with a particular set of bug/tasks and then adds new ones slowly over time. Each
task has a particular random chance of being part of a given feature. On
average, a task is included in ~3 different features. Then the job of the
product managers (PMs) is to decide what features are targeted for our first
milestone. For the simulation, we assume the first milestone will always
target exactly two features (and we always start with the same two).
The release will be ready when the number of open tasks for the two selected features hits zero.
Interestingly, look at how straight those downward-sloping lines are. Those are really easy to extrapolate. You don’t even need a real simulator; the simulator is just a way of proving that the simulator is not needed. If the simulator were an AI, it would be having an existential crisis right now. Anyway, those easy-to-extrapolate lines are going to turn out to be kind of important later.
For now, note the effect on release date of the various patterns of PMs changing their minds. Each jump upward on the chart is caused by a PM dropping one feature from the release, and swapping in a new one instead. It’s always drop one, add one. The ideal case is the top one: make up your mind at the start and let people finish. That finishes the soonest. The second one is unfortunate but not too bad: they change their mind once near the start. The third chart shows how dramatically more expensive it is to change your mind late in the game: we drop the same feature and add the same other feature, but now the cost is much higher. In fact, the cost is much higher than changing their minds multiple times near the beginning, shown in the fourth chart, where we make little progress at all during the bickering, but then can go pretty fast. And of course, all too common, we have the fifth chart, where the PMs change their minds every couple of months (again, dropping and adding one feature each time). When that happens, we literally never finish. Or worse, we get frustrated and lower the quality, or we just ship whatever we have when Whatever Conference comes.
The purported quote near the top, “When the facts change, I change my mind,” I
included because of its implied inverse: “When the facts don’t change, I
don’t change my mind.” One of the great product management diseases is
that we change our minds when the facts don’t. There are so many opinions,
and so much debate, and everyone feels that they can question everyone else
(note: which is usually healthy), that we sometimes don’t stick to decisions
even when there is no new information. And this is absolutely paralyzing.
If you haven’t even launched – in the picture above, you haven’t even hit a
new release milestone – then how much could conditions possibly have
changed? Sometimes a competitor will release something unexpected, or the
political environment will change, or something. But usually that’s not the
case. Usually, as they don’t say in warfare, your strategy does survive
until first contact with the enemy (er, the customer).
It’s when customers get to try it that everything falls out from under you.
But when customers get to try it, we’ve already shipped! And if we’ve been careful only to work on shipping features until then, when we inevitably must change our minds based on customer feedback, we won’t have wasted much time building the wrong things.
If you want to know what Tesla does right and most of us do wrong, it’s this: they ship something small, as fast as they can. Then they listen. Then they make a decision. Then they stick to it. And repeat.
They don’t make decisions any better than we do. That’s key. It’s not the quality of the decisions that matters. Well, I mean, all else being equal, higher quality decisions are better. But even if your decisions aren’t optimal, sticking to them, unless you’re completely wrong, usually works better than changing them.
If you only take one thing away from reading this talk, it should be that. Make decisions and stick to them.
Now, about decision quality. It’s good to make high quality decisions. Sometimes that means you need to get a lot of people involved, and they will argue, and that will take some time. Okay, do that. But do it like chart #4, where the PMs bicker for literally 50 days straight and we run in circles, but the launch delay was maybe 30 days, less than the 50 you spent. Get all the data up front, involve all the stakeholders, make the best decision you can, and stick to it. That’s a lot better than chart #3 (just a single change in direction!) or #5 (blowing in the wind).
Sorry for the rant in that last slide. Moving on, we have more ideas to steal, this time from Kanban.
For people who don’t know, Kanban was another thing computer people copied from Japanese car manufacturers of the 1950s. They, too, used actual paper index cards, although they used them in a really interes