Skip to content Skip to footer
0 items - $0.00 0

Performance of the Python 3.14 tail-call interpreter by signa11

Performance of the Python 3.14 tail-call interpreter by signa11

Performance of the Python 3.14 tail-call interpreter by signa11

11 Comments

  • Post Author
    IshKebab
    Posted March 10, 2025 at 7:23 am

    Very interesting! Clearly something else going on though if the 2% Vs 9% thing is true.

  • Post Author
    jeeybee
    Posted March 10, 2025 at 7:31 am

    Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?

  • Post Author
    unit149
    Posted March 10, 2025 at 7:52 am

    [dead]

  • Post Author
    asicsp
    Posted March 10, 2025 at 7:58 am
  • Post Author
    tempay
    Posted March 10, 2025 at 8:02 am

    Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:

    https://github.com/astral-sh/python-build-standalone/pull/54…

    I'd be interested to know how the tail-call interpreter performs with other build optimisations that exist.

  • Post Author
    vkazanov
    Posted March 10, 2025 at 8:08 am

    So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced… Not surprised.

    1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.

    2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.

    Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!

    Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:

    https://github.com/vkazanov/bytecode-interpreters-post

  • Post Author
    kryptiskt
    Posted March 10, 2025 at 8:23 am

    This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.

    As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."

  • Post Author
    MattPalmer1086
    Posted March 10, 2025 at 8:28 am

    Benchmarking is just insanely hard to do well. There are so many things which can mislead you.

    I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one… And it was still 15% faster. So code that never executed sped up the original code…!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.

    It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.

    Casey Muratori has a really interesting series about things like this in his substack.

  • Post Author
    albertzeyer
    Posted March 10, 2025 at 8:30 am

    To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?

    Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.

  • Post Author
    thrdbndndn
    Posted March 10, 2025 at 8:43 am

    Great article! One detail caught my attention.

    In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."

    I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.

  • Post Author
    motbus3
    Posted March 10, 2025 at 9:23 am

    I recently made some benchmarking from python 3.9 to 3.13
    And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.

    I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics.
    Does anyone else have the same problem?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.