About a month ago, the CPython project merged a new implementation strategy for their bytecode interpreter. The initial headline results were very impressive, showing a 10-15% performance improvement on average across a wide range of benchmarks across a variety of platforms.
Unfortunately, as I will document in this post, these impressive performance gains turned out to be primarily due to inadvertently working around a regression in LLVM 19. When benchmarked against a better baseline (such GCC, clang-18, or LLVM 19 with certain tuning flags), the performance gain drops to 1-5% or so depending on the exact setup.
When the tail-call interpreter was announced, I was surprised and impressed by the performance improvements, but also confused: I’m not an expert, but I’m passingly-familiar with modern CPU hardware, compilers, and interpreter design, and I couldn’t explain why this change would be so effective. I became curious – and perhaps slightly obsessed – and the reports in this post are the result of a few weeks of off-and-on compiling and benchmarking and disassembly of dozens of different Python binaries, in an attempt to understand what I was seeing.
At the end, I will reflect on this situation as a case study in some of the challenges of benchmarking, performance engineering, and software engineering in general.
I also want to be clear that I still think the tail-calling interpreter is a great piece of work, as well as a genuine speedup (albeit more modest than initially hoped). I am also optimistic it’s a more robust approach than the older interpreter, in ways I’ll explain in this post. I also really don’t want to blame anyone on the Python team for this error. This sort of confusion turns out to be very common – I’ve certainly misunderstood many a benchmark myself – and I’ll have some reflections on that topic at the end.
In addition, the impact of the LLVM regression doesn’t seem to have been known prior to this work (and the bug still isn’t fixed, as of this writing); thus, in that sense, the alternative (without this work) probably really was 10-15% slower, for builds using clang-19 or newer. For instance, Simon Willison reproduced the 10% speedup “in the wild,” as compared to Python 3.13, using builds from python-build-standalone
.
Here are my headline results. I benchmarked several builds of the CPython interpreter, using multiple different compilers and different configuration options, on two machines: an Intel server (a Raptor Lake i5-13500 I maintain in Hetzner), and my Apple M1 Macbook Air. You can reproduce these builds using my nix
configuration, which I found essential for managing so many different moving pieces at once.
All builds use LTO and PGO. These configurations are:
clang18
: Built using Clang 18.1.8, using computed gotos.gcc
(Intel only): Built with GCC 14.2.1, using computed gotos.clang19
: Built using Clang 19.1.7, using compute gotos.clang19.tc
: Built using Clang 19.1.7, using the new tail-call interpreter.clang19.taildup
: Built using Clang 19.1.7, computed gotos and some-mllvm
tuning flags which work around the regression.
I’ve used clang18
as the baseline, and reported the bottom-line “average” reported by pypeformance
/pyperf compare_to
. You can find the complete output files and reports on github.
Platform | clang18 | clang19 | clang19.taildup | clang19.tc | gcc |
---|---|---|---|---|---|
Raptor Lake i5-13500 | (ref) | 1.09x slower | 1.01x faster | 1.03x faster | 1.02x faster |
Apple M1 Macbook Air | (ref) | 1.12x slower | 1.02x slower | 1.00x slower | N/A |
Observe that the tail-call interpreter still exhibits a speedup as compared to clang-18, but that it’s far less dramatic than the slowdown from moving to clang-19. The Python team has also observed larger speedups than I have (after accounting for the bug) on some other platforms.
A brief background 🔗︎
A classic bytecode interpreter consists of a switch
statement inside of a while
loop, looking something like so:
while (true) {
opcode_t this_op = bytecode[pc++];
switch (this_op) {
case OP_IMM: {
// push an immediate onto the stack
break;
}
case OP_ADD: {
// handle the add
break;
}
// etc
}
}
Most compilers will compile the switch
into a jump table – they will emit a table containing the address of each case OP_xxx
block, index into it with the opcode, and perform an indirect jump.
It’s long been known that you can speed up a bytecode interpreter of this style by replicating the jump table dispatch into the body of each opcode. That is, instead of ending each opcode with a jmp loop_top
, each opcode contains a separate instance of the “decode next instruction and index through the jump table” logic.
Modern C compilers support taking the address of labels, and then using those labels in a “computed goto,” in order to implement this pattern. Thus, many modern bytecode interpreters, including CPython (before the tail-call work), employ an interpreter loop that looks something like:
static void *opcode_table[256] = {
[OP_IMM] = &&TARGET_IMM,
[OP_ADD] = &&TARGET_ADD,
// etc
};
#define DISPATCH() goto *opcode_table[bytecode[pc++]]
DISPATCH();
TARGET_IMM: {
// push an immediate onto the stack
DISPATCH();
}
TARGET_ADD: {
// handle the add
DISPATCH();
}
Computed goto in LLVM 🔗︎
For performance reasons (performance of the compiler, not the generated code), it turns out that Clang and LLVM, internally, actually merges all of the goto
s in the latter code into a single indirectbr
LLVM instruction, which each opcode will jump to. That is, the compiler takes our hard work, and deliberately rewrites into a control-flow-graph that looks essentially the same as the switch
-based interpreter!
Then, during code generation, LLVM performs “tail duplication,” and copies the branch back into each location, restoring the original intent. This dance is documented, at a high level, in an old LLVM blog post introducing the new implementation.
The LLVM 19 regression 🔗︎
The whole reason for the deduplicate-then-copy dance is that, for technical reasons, creating and manipulating the control-flow-graph containing many indirectbr
instructions can be quite expensive.
In order to avoid catastrophic slowdowns (or memory usage) in certain cases, LLVM 19 implemented some limits on tail-duplication pass, causing it to bail out if duplication would blow up the size of the IR past certain limits.
Unfortunately, on CPython those limits resulted in Clang leaving all of the dispatch jumps merged, and entirely undoing the whole purpose of the computed goto
-based implementation! This bug was first identified by another language implementation with a similar interpreter loop, but had not been known (as far as I can find) to affect CPython.
In addition to the performance impact, we can observe the bug directly by disassembling the resulting object code and counting the number of distinct indirect jumps:
$ objdump -S --disassemble=_PyEval_EvalFrameDefault ${clang18}/bin/python3.14 |
egrep -c 'jmps+*'
332
$ objdump -S --disassemble=_PyEval_EvalFrameDefault ${clang19}/bin/python3.14 |
egrep -c 'jmps+*'
3
Further weirdness 🔗︎
I am confident that the change to the tail-call duplication logic caused the regression: if you fix it, performance matches clang-18. However, I can’t fully explain the magnitud
11 Comments
IshKebab
Very interesting! Clearly something else going on though if the 2% Vs 9% thing is true.
jeeybee
Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?
unit149
[dead]
asicsp
Related discussions:
https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-… –> https://news.ycombinator.com/item?id=42999672 (66 points | 25 days ago | 22 comments)
https://blog.reverberate.org/2025/02/10/tail-call-updates.ht… –> https://news.ycombinator.com/item?id=43076088 (124 points | 18 days ago | 92 comments)
tempay
Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:
https://github.com/astral-sh/python-build-standalone/pull/54…
I'd be interested to know how the tail-call interpreter performs with other build optimisations that exist.
vkazanov
So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced… Not surprised.
1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.
2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.
Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!
Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:
https://github.com/vkazanov/bytecode-interpreters-post
kryptiskt
This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.
As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."
MattPalmer1086
Benchmarking is just insanely hard to do well. There are so many things which can mislead you.
I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one… And it was still 15% faster. So code that never executed sped up the original code…!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.
It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.
Casey Muratori has a really interesting series about things like this in his substack.
albertzeyer
To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?
Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.
thrdbndndn
Great article! One detail caught my attention.
In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."
I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.
motbus3
I recently made some benchmarking from python 3.9 to 3.13
And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.
I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics.
Does anyone else have the same problem?