[Simon Tatham, 2023-09-01]
- Introduction
- Why I’m so enthusiastic about coroutines
- Techniques for getting the most out of coroutines
- Coroutine paradigms
- Conclusion
- Footnotes
Introduction
I’ve been a huge fan of coroutines since the mid-1990s.
I encountered the idea as a student, when I first read The
Art of Computer Programming. I’d been programming already
for most of my childhood, and I was completely blown away by the
idea, which was entirely new to me. In fact, it’s probably not
an exaggeration to say that in the whole of TAOCP that
was the single thing that changed my life the most.
Within a few months of encountering the idea for the first
time, I found myself wanting to use it in real code, during a
vacation job at a (now defunct) tech company. Sadly I was coding
in C at the time, which didn’t support coroutines. But I didn’t
let that stop me: I came up with a C preprocessor trick that
worked well enough to fake them, and did it anyway.
The trick I came up with has limitations, but it’s reliable
enough to use in serious code. I wrote it up a few years later
in the
article ‘Coroutines
in C’, and I’ve been using it ever since, in both C and C++,
whenever the code I’m trying to write looks as if it would be
most naturally expressed in that style.
Or rather, I do that in code bases where I can get away
with it, because it’s unquestionably a weird and unorthodox
style of C by normal standards, and not everyone is prepared to
take it in their stride. Mostly I limit my use of C-preprocessor
coroutines to free software projects where I’m the lead or only
developer (in
particular, PuTTY
and spigot
),
because there, I get to choose what’s an acceptable coding style
and what’s not. The only time I’ve ever used the same technique
in code I wrote for an employer was in that original vacation
job where I invented the trick – and I’m sure I only got away
with it that time because my team didn’t practise code
review.
Of course, I’m also happy to use coroutine-style idioms in any
language where you don’t have to resort to trickery,
such as using generators in Python whenever it seems like a good
idea.
Whenever I can’t use coroutines, I feel limited,
because it’s become second nature to me to spot parts of a
program that want to be written in coroutine style, and
would be more awkward to express in other ways. So I’m very much
in favour of coroutines becoming more mainstream – and I’ve been
pleased to see support appearing in more and more languages over
the decades. At the time of writing this, Wikipedia has
an impressive
list of languages that now include coroutine support; from
my own perspective, the big ones are Python generators, and the
addition of a coroutine system to C++20.
I recently got round
to learning
about the C++20 system in detail. In the course of that, I
had conversations about coroutines with several friends and
colleagues, which made me think more seriously about the fact
that I’m so fond of them, and wonder if I had anything more
specific to say about that than ‘they’re really useful’.
It turned out that I did. So this article presents my
‘philosophy of coroutines’: why I find them so useful,
what kinds of thing they’re useful for, tricks I’ve learned for
getting the most out of them, and even a few different ways of
thinking about them.
But it discusses coroutines as a general concept, not specific
to their implementation in any particular language. So where I
present code snippets, they’ll be in whatever language is
convenient to the example, or even in pure pseudocode.
Why I’m so enthusiastic about coroutines
Go on then, why do I like coroutines so much?
Well, from first principles, there are two possibilities.
Either it’s because they really are awesome, or it’s because
they suit my personal taste. (Or a combination of the two, of
course.)
I can think of arguments for both of those positions, but I
don’t think I have the self-awareness to decide which is more
significant. So I’ll just present both sets of arguments, and
let you make up your own mind about whether you end up also
thinking they’re awesome.
The objective view: what makes them useful?
To answer ‘what makes coroutines useful?’, it’s useful to start
by deciding what we’re comparing them to. I can’t answer
‘why is a coroutine more useful than X?’ without knowing
what X we’re talking about. And there’s more than one
choice.
Versus explicit state machines
Two decades ago, when I wrote
‘Coroutines
in C’, I included an explanation of why coroutines are
useful, based on the idea that the alternative was to write an
ordinary function containing an explicit state machine.
Suppose you have a function that consumes a stream of values,
and due to the structure of the rest of your program, it has to
receive each individual value via a separate function call:
function process_next_value(v) { // ... }
And suppose the nature of the stream is such that the way you
handle each input value depends on what the previous values
were. I’ll show an example similar to the previous article,
which is a simple run-length decompressor processing a stream of
input bytes: the idea is that most input bytes are emitted
literally to the output, but one byte value is special, and is
followed by a (length, output byte) pair indicating that the
output byte should be repeated a number of times.
To write that as a conventional function accepting a single
byte on each call, you have to keep a state variable of some
kind to remind you of what you were going to use the next byte
for:
function process_next_byte(byte) { if (state == TOP_LEVEL) { if (byte != ESCAPE_BYTE) { // emit the byte literally and stay in the same state output(byte); } else { // go into a state where we expect to see a run length next state = EXPECT_RUN_LENGTH; } } else if (state == EXPECT_RUN_LENGTH) { // store the length run_length = byte; // go into a state where we expect the byte to be repeatedly output state = EXPECT_OUTPUT; } else if (state == EXPECT_OUTPUT) { // output this byte the right number of times for i in (1,...,run_length) output(byte); // and go back to the top-level state for the next input byte state = TOP_LEVEL; } }
(This is pseudocode, so I’m not showing the details of how you
arrange for state
and run_length
to be
preserved between calls to the function. Whatever’s convenient
in your language. These days they’re probably member variables
of a class, in most languages.)
This is pretty unwieldy code to write, because structurally, it
looks a lot like a collection of blocks of code connected by
‘goto’ statements. They’re not literally goto
statements – each one works by setting state
to
some value and returning from the function – but they have the
same semantic effect, of stating by name which part of the
function is going to be run next.
The more complicated the control structure is, the more
cumbersome it becomes to write code in this style, or to read
it. Suppose I wanted to enhance my run-length compression format
so that it could efficiently represent repetitions of
a sequence of bytes, not just AAAAAAA
but ABABABAB
or ABCDEABCDE
? Then I
wouldn’t just need an output loop to emit n
copies of something; I’d also need a loop on the input side,
because the input format would probably contain an ‘input
sequence length’ byte followed by that many bytes of literal
input. But I can’t write that one as a for
statement, because it has to return to the caller to get each
input byte; instead, I’d have to build a ‘for loop’ structure
out of these small building blocks.
The more code you write in this style, the more you probably
wish that the caller and callee were the other way round. If you
were writing the same run-length decompression code in a context
where you called a function to read the next byte, it
would look a lot more natural (not to mention shorter), because
you could call the get-byte function in multiple places:
function run_length_decompress(stream) { while (byte = stream.getbyte()) { if (byte != ESCAPE_BYTE) { output(byte); } else { run_length = stream.getbyte(); byte_to_repeat = stream.getbyte(); for i in (1,...,run_length) output(byte_to_repeat); } } }
This function actually contains the same state machine as the
previous version – but it’s entirely implicit. If you want to
know what each piece of code will do when its next input byte
arrives (say, during debugging), then in the state-machine
version you answer it by asking for the value
of state
– but in this version, you’d answer it by
looking at a stack backtrace to find out which of the calls
to stream.getbyte()
the byte will be returned from,
because that’s what controls where execution will resume from.
In other words, each of the three state
values TOP_LEVEL
, EXPECT_RUN_LENGTH
and EXPECT_OUTPUT
in the state-machine code
corresponds precisely to one of the three calls
to stream.getbyte()
in this version. So there’s
still a piece of data in the program tracking which of those
‘states’ we’re in – but it’s the program counter, and
doesn’t need to be a named variable with a set of explicit named
or numbered state values.
And if you needed to extend this version of the code
so that it could read a list of input bytes to repeat, it would
be no trouble at all, because there’s nothing to stop you
putting a call to stream.getbyte()
in a loop:
num_output_repetitions = stream.getbyte(); num_input_bytes = stream.getbyte(); for i in (1,...,num_input_bytes) input_bytes.append(stream.getbyte()); for i in (1,...,num_output_repetitions) for byte in input_bytes output(byte);
and you wouldn’t have to mess around with inventing extra
values in your state enumeration, coming up with names for them,
checking the names weren’t already used, renaming other states
if their names had become ambiguous, carefully writing
transitions to and from existing states, etc. You can just
casually write a loop, and the programming language takes care
of the rest.
(I tend to think of the state-machine style as ‘turning the
function inside out’, in a more or less literal sense: the calls
to a ‘get byte’ function ought to be nested
deeply inside the code’s control flow, but instead,
we’re forced to make them the outermost layer, so that
‘now wait for another byte’ consists of control falling off the
end of the function, and ‘ok, we’ve got one now, carry on’ is at
the top.)
Using coroutines, you can write the code in this latter style,
with the explicit state machine replaced by the implicit program
counter, even if the rest of the program needs to work
by passing each output byte to a function call.
In other words, you can write this part of the program ‘the way
it wants to be written’ – the way that is most convenient for
the nature of what the code has to do. And you didn’t have to
force the rest of the program to change its structure
to compensate. That might have required equally awkward
contortions elsewhere, in which case you’d only have moved the
awkwardness around, and not removed it completely.
Coroutines give you greater freedom to choose the structure of
each part of your program, independently of the other parts. So
each part can be as clear as possible.
Coroutines mean never having to turn your code inside
out.
Versus conventional threads
In the 1990s and early 2000s, that was the only argument I felt
I needed in favour of coroutines. But now it’s 2023, and
coroutines have another competitor: multithreading.
Another way to solve the code-structure dilemma in the previous
section would be to put the decompression code into its own
thread. You’d have to set up a thread-safe method of giving it
its input values – say, some kind of lock-protected queue or ring
buffer, which would cause the decompressor thread to block if
the queue was empty, and the thread providing the data to block
if it was full. But then there would be no reason
to pretend either side was making a function call when
it really wasn’t. Each side really would be making a
function call to provide an output byte or read an input byte,
and there would be no contradiction, because the call stacks of
the two threads are completely independent.
One of my software
projects, spigot
,
is a complicated C++ program completely full of coroutines that
generate streams of data. I was chatting to a friend recently
about the possibility – in principle – of rewriting it in Rust.
But I don’t actually know a lot about Rust yet (if I did try
this, it would be a learning experience and a half!), and so of
course I asked: ‘What about all the coroutines? Would I
implement those using async Rust in some way, or what?’
My Rust-programmer friend replied: no, don’t even bother. Just
use threads. Turn each of spigot
’s coroutines into
ordinary non-async Rust code, and run it in its own thread,
passing its output values to some other thread which will block
until a value is available. It’ll be fine. Threads are
cheap.
I have to start by acknowledging my personal biases:
my first reaction to that attitude is purely emotional.
It just gives me the heebie-jeebies.
My own programming style has always been single-threaded
wherever possible. I mostly only use extra threads when I really
can’t avoid it, usually because some operating system or library
API makes it impossible, or astonishingly awkward, to
do a thing any other way.
I’ve occasionally gone as far as parallelising a large number of
completely independent computations using some convenient
ready-made system like Python multiprocessing
, but
I’m at least as likely to do that kind of thing by putting each
computation into a completely separate process, and
parallelising them at the process level, using
a make
-type tool or GNU parallel
.
Partly, that’s because I still think of threads as heavyweight.
On Linux, they consume a process id, and those have a limited
supply. Switching between threads requires a lot more saving and
restoring of registers than an ordinary function call,
and also needs a transfer from user mode to kernel mode
and back, with an extra set of saves and restores for that. The
data structures you use to transfer data values between the
threads must be protected by a locking mechanism, which costs
extra time and faff. Perhaps all of these
considerations are minor in 2023, where they were significant in
2000? I still feel as if I wouldn’t want to commit to
casually making 250 threads in the course of a
single spigot
computation (which is about the
number of coroutines it constructs in the course of computing a
few hundred digits
of eπ − π), or
to doing a kernel-level thread switch for every single time one
of those threads passes a tuple of four integers to
another. spigot
is slow enough as it is – and one
of my future ambitions for it is to be able to parallelise 1000
of those computations! But I have to admit that I
haven’t actually tried this strategy and benchmarked it.
So maybe I have an outdated idea of what it would
cost.
Also, I think of threads as dangerous. You have to do
all that locking and synchronisation to prevent race conditions,
and it’s often complicated, and easy to get wrong. My instinct
is to view any non-trivial use of threads as 100 data-race bugs
waiting to happen. (And if you manage to avoid all of those,
probably half a dozen deadlocks hiding behind them!) Now in
this particular conversation we were talking about
Rust, and that’s a special case on this count, because Rust’s
unique selling point is to guarantee that your program is free
of data races by the time you’ve managed to get it to compile at
all. But in more or less any other language I think I’m still
right to be scared!
So I have strong prejudices against ‘just use real threads,
it’ll be fine’. But I accept that I might be wrong
about those. Supposing I am, are there any other arguments for
using coroutines instead?
One nice feature of everything being in the same system thread
is debuggability. If you’re using an interactive debugger, you
don’t have to worry about which thread you’re debugging, or
which thread a breakpoint applies to: if the whole overall
computation is happening in the same thread then you can just
place a breakpoint in the normal way and it will be hit.
And if you prefer debugging via ‘recompile with print
statements’, that’s also more convenient in a
single-threaded setup where the transfers of control are
explicit, because you don’t have to worry about adding extra
locks to ensure the debug messages themselves are atomic when
multiple threads are producing them. You might not mind having
to do fiddly lock-administration when you’re setting up
the real data flow of your program, but it definitely
slows down investigation if you have to do it for every
throwaway print statement!
But another nice thing about coroutines is that they’re easy
to abandon. If a thread is in the middle of blocking on
a lock, and you suddenly discover you no longer need that thread
(and, perhaps, not the one it was blocking on either, or 25
surrounding threads), then depending on your language’s
threading system, it can get very painful to arrange to
interrupt all those threads and terminate them. Even if you
manage it at all, will all the resources they were holding get
cleaned up? (File descriptors closed, memory freed, etc.)
I don’t think thread systems are generally great at this. But
coroutines are: a suspended coroutine normally has an
identity as some kind of actual object in your programming
language, and if you destroy or delete or free it, then you can
normally arrange for it to be cleaned up in a precise manner,
freeing any resources.
Using spigot
as an example again: most of its
coroutines run indefinitely, and are prepared to yield an
infinite stream of data if you want them to. At some point the
ultimate consumer of the whole computation decides it’s got
enough output, and destroys the entire web of connected
coroutines. Using C++ with preprocessor-based coroutines, this
is reliably achieved with zero memory leaks. I’d hate to have to
figure out how to terminate 250 actual system threads as
cleanly.
Also, having a language object identifying the coroutine
instance can be used for other stunt purposes. I’ll talk more
about that in a later section.
So I still prefer coroutines to threads. But I have to
acknowledge that my reasons are a little bit more borderline
than the reasons in the previous section.
(However, this discussion is all about pre-emptive
threads. Cooperative threads which yield to each other
explicitly are a different matter, and I’ll talk about
those later as well.)
The subjective view: why do I like
them so much?
This is already the kind of essay where I acknowledge my own
bias. So I should also admit the possibility that it’s not so
much that coroutines are objectively amazing, but rather, that
they appealed especially strongly to me personally for
some reason that doesn’t apply to everybody.
I have a couple of theories about what that reason might be, if
there is one.
“Teach the student when the student is ready”
One possibility is that the concept was presented to me at
exactly the right stage in my education.
I’d already been programming for a number of years when I
read TAOCP, so I’d already felt for myself the
frustration of having to write a piece of code ‘inside out’ –
that is, in the form of an explicit state machine that I’d
rather have avoided – and I hadn’t thought of any way to make
that less annoying. So when Donald Knuth helpfully presented me
with one, I pounced on it with gratitude.
But if I’d encountered the idea of coroutines
earlier, before I’d felt that frustration, then I might
not have appreciated it as much. It would have been one among
many concepts that went through my brain as I read the books,
and I’d have thought, ‘ok, fine, I expect that’s useful
for something’, and then I’d have forgotten it (along
with half of the rest of TAOCP, because there’s
a lot of stuff in there!). And perhaps I might not have
managed to recall it later, when the problem did appear in my
life.
Conversely, the idea of coroutines might also have had
less impact on me if I’d learned about it later. I have
a strong suspicion that if I’d spent another (say) five years
writing pieces of code inside out because I had no idea there
was another option, I’d have got much better at writing
code inside out, and much more used to it, and then the
opportunity to avoid having to do it wouldn’t have seemed like
such a huge boon any more.
Also, by that time, I’d probably have thought of some actual
advantages to the inside-out state-machine style of coding, and
started making use of them. There are often silver linings to be
found in situations like this. In this case I can’t say for sure
what they might be (since I never did become that
practised at the skill); but one possibility that springs to
mind is that an explicit list of states might make exhaustive
testing more convenient (you can ensure you’ve tested every
state and every transition). But whatever the compensating
advantage might be, once I’d noticed it, I’d have been reluctant
to lose it for the sake of (as I would have seen it)
not that much extra convenience.
(This is probably related to Paul Graham’s idea of
the Blub
Paradox, in which you look at a less powerful language than
your favourite one and you know it’s less powerful,
because it lacks some feature you’re used to, and you can’t
imagine how you’d get anything done without that. But you look
at a more powerful language that has features yours doesn’t, and
whatever they are, you seem to have been getting along fine
without them – so you conclude that they can’t be that
important, and that the ‘more powerful’ language isn’t that much
better really.)
So perhaps it’s just that coroutines hit me at precisely my
moment of peak frustration with exactly the problem they solve,
and that’s why I adopted them with so much enthusiasm.
They suit my particular idea of code clarity
When you look at an existing piece of code, there are two
different ways you might want to understand it.
One way is: on its own terms, what does this code do, and
how do I modify it to do it differently? For example, in
the decompressor example a few
sections ago, suppose you wanted to look at the code and figure
out what the compressed data format was – either to write a
specification, or to check it against an existing specification.
Or suppose someone had just changed the specification
for the compressed data format, and you had to adjust the code
for the new spec.
For that purpose, it doesn’t really matter how the code fits
into the rest of the program. In fact you don’t even need
to see the rest of the program. You could figure out
the format from either of the code snippets I showed
(the one with a state machine, or the one that
calls stream.getbyte()
), and it wouldn’t matter at
all that the rest of the program hasn’t even been written. In
both cases, you can see that the code is getting a stream of
bytes from somewhere, be it a caller or a subroutine;
you can follow the logic through and see what bytes cause what
effects; if you’re updating the code, you just have to follow
the existing idioms for input and output.
But although you can do that level of work on either
version of the code, it’s much easier to do it on the version
that calls a ‘get byte’ function. At least, I think so; if
anyone actually finds the state machine version easier
to work with in that kind of way, I’d be interested to know
why!
But the other way you might want to understand a piece of code
is: how does this fit into the rest of the program, and how
can I change that? If you had no interest at all in the
compressed data format and no reason to think this decompressor
was delivering incorrect output, but you were concerned with
what was being done next with the data being
decompressed, then you’d want to find out not ‘what bytes are
output, and how can I change that?’, but ‘when a byte is output,
where does it go next, and is it the right place, and how do I
reroute it to somewhere else?’.
For this purpose, coroutines often make things less
clear, rather than more. That’s especially true if they’re based
on my C preprocessor system, because a person coming fresh to a
function in that style is going to see some
weirdo crReturn
macro, look up the definition, be
further baffled (it’s fairly confusing if you don’t already
recognise the trick), and probably conclude that I’m throwing
obstacles in their path on purpose! But other less ad-hoc
coroutine systems have the same property; for example, in C++20
coroutines, the mechanism for dealing with yielded data values
lives in a ‘promise class’ type, which is inferred from the type
signature of the coroutine itself by a system of template
specialisations that could be almost anywhere in the source file
or any of its header files. If there isn’t clear documentation
somewhere in your code base, it might be legitimately quite hard
to find out what happens to data after it leaves a C++20
coroutine; you might very well find that the simplest way to
answer the question was to step through the yield in a debugger
and see what happened next.
Which of these kinds of clarity is more important? Of course,
both of them are important, because both are things you
might need to understand, or fix, or update in your code. But if
you spend 90% of time working on the code in one of those ways,
and 10% of your time in the other way, you’re going to prefer
the kind of clarity that helps you 90% of the time over the kind
that helps you 10% of the time.
And it so happens that my own preference skews towards the
first kind. Perhaps that has to do with the types of program I
work on. All my life I’ve gravitated to programs that solve
complicated problems, because that’s the kind of thing I enjoy
doing. In those programs, the individual subproblems are
complicated and so are the pieces of code that deal with them;
the data flow between them is a small part of the overall
system. So I don’t want to make debugging each sub-piece of code
harder, just for the sake of making the data flow more explicit.
I’d rather each piece made sense on its own terms at the expense
of the data flow being a little more opaque.
This is also one example of a more general question. Is it
better for each program to effectively define its own local
customised variant of the language it’s written in (via very
elaborate support libraries, metaprogramming tricks like macros,
and the like), so that the learning curve takes a long time to
climb, but you can work really effectively once you’ve climbed
it? Or is it better to avoid diverging too much from the
standard ways to use a language, so that a new programmer can
get up to speed quickly?
Again, my preference skews toward the former. I value the
ability to make my own life easier if I’m spending a long time
in a code base. And I don’t much mind climbing the learning
curves for other people’s differently customised coding
environments, because I often learn something in the process
(and maybe steal any good ideas I find). But I know that there
are other people who are much more in favour of sticking to the
orthodoxy.
(However, that’s only an argument against coroutines as long as
they aren’t the orthodoxy. With more mainstream
adoption, perhaps this argument is weakening. Generators in
Python are definitely mainstream now: I’ve occasionally
had a code reviewer tell me to use them more!)
Techniques for getting the most out of coroutines
Of course, another reason why one person might find a language
feature more useful than someone else is that they’re better at
using it: better at spotting cases where it’s useful, better at
getting the most out of it when they do use it. And let’s not
forget the skill of avoiding using it when
it isn’t useful, so that you don’t get annoyed by
situations where it’s not really a good fit.
Here’s a collection of my thoughts on when and how to use
coroutines for best effect.