I want to address a controversy that has gripped the Rust community for the past year or so: the
choice by the prominent async “runtimes” to default to multi-threaded executors that perform
work-stealing to balance work dynamically among their many tasks. Some Rust users are
unhappy with this decision, so unhappy that they use language I would characterize as
melodramatic:
The Original Sin of Rust async programming is making it multi-threaded by default. If premature
optimization is the root of all evil, this is the mother of all premature optimizations, and it
curses all your code with the unholySend + 'static
, or worse yetSend + Sync + 'static
, which
just kills all the joy of actually writing Rust.
It’s always off-putting to me that claims written this way can be taken seriously as a technical
criticism, but our industry is rather unserious.
What these people advocate instead is an alternative architecture that they call “thread-per-core.”
They promise that this architecture will be simultaneously more performant and easier to implement.
In my view, the truth is that it may be one or the other, but not both.
(Side note: Some people prefer instead just running single threaded servers, claiming that they are
“IO bound” anyway. What they mean by IO bound is actually that their system doesn’t use enough work
to saturate a single core when written in Rust: if that’s the case, of course write a single
threaded system. We are assuming here that you want to write a system that uses more than one core
of CPU time.)
Thread-per-core
One of the biggest problems with “thread-per-core” is the name of it. All of the multi-threaded
executors that users are railing against are also thread-per-core, in the sense that they create an
OS thread per core and then schedule a variable number of tasks (expected to be far greater than the
number of cores) over those threads. As Pekka Enberg tweeted in response to a
comment I made about thread per core:
Thread per core combines three big ideas: (1) concurrency should be handled in userspace instead
of using expensive kernel threads, (2) I/O should be asynchronous to avoid blocking per-core
threads, and (3) data is partitioned between CPU cores to eliminate synchronization cost and data
movement between CPU caches. It’s hard to build high throughput systems without (1) and (2), but
(3) is probably only needed on really large multicore machines.
Enberg’s paper on performance, which is called “The Impact of Thread-Per-Core
Architecture on Application Tail Latency” (and which I will return to in a moment), is the origin of
the use of the term “thread-per-co