I remember the first time I used the v1.0 of Visual Basic. Back then, it was a program for DOS. Before it, writing programs was extremely complex and I’d never managed to make much progress beyond the most basic toy applications. But with VB, I drew a button on the screen, typed in a single line of code that I wanted to run when that button was clicked, and I had a complete application I could now run. It was such an amazing experience that I’ll never forget that feeling.
It felt like coding would never be the same again.
Writing code in Mojo, a new programming language from Modular1 is the second time in my life I’ve had that feeling. Here’s what it looks like:
Why not just use Python?
Before I explain why I’m so excited about Mojo, I first need to say a few things about Python.
Python is the language that I have used for nearly all my work over the last few years. It is a beautiful language. It has an elegant core, on which everything else is built. This approach means that Python can (and does) do absolutely anything. But it comes with a downside: performance.
A few percent here or there doesn’t matter. But Python is many thousands of times slower than languages like C++. This makes it impractical to use Python for the performance-sensitive parts of code – the inner loops where performance is critical.
However, Python has a trick up its sleeve: it can call out to code written in fast languages. So Python programmers learn to avoid using Python for the implementation of performance-critical sections, instead using Python wrappers over C, FORTRAN, Rust, etc code. Libraries like Numpy and PyTorch provide “pythonic” interfaces to high performance code, allowing Python programmers to feel right at home, even as they’re using highly optimised numeric libraries.
Nearly all AI models today are developed in Python, thanks to the flexible and elegant programming language, fantastic tools and ecosystem, and high performance compiled libraries.
But this “two-language” approach has serious downsides. For instance, AI models often have to be converted from Python into a faster implementation, such as ONNX or torchscript. But these deployment approaches can’t support all of Python’s features, so Python programmers have to learn to use a subset of the language that matches their deployment target. It’s very hard to profile or debug the deployment version of the code, and there’s no guarantee it will even run identically to the python version.
The two-language problem gets in the way of learning. Instead of being able to step into the implementation of an algorithm while your code runs, or jump to the definition of a method of interest, instead you find yourself deep in the weeds of C libraries and binary blobs. All coders are learners (or at least, they should be) because the field constantly develops, and no-one can understand it all. So difficulties learning and problems for experienced devs just as much as it is for students starting out.
The same problem occurs when trying to debug code or find and resolve performance problems. The two-language problem means that the tools that Python programmers are familiar with no longer apply as soon as we find ourselves jumping into the backend implementation language.
There are also unavoidable performance problems, even when a faster compiled implementation language is used for a library. One major issue is the lack of “fusion” – that is, calling a bunch of compiled functions in a row leads to a lot of overhead, as data is converted to and from python formats, and the cost of switching from python to C and back repeatedly must be paid. So instead we have to write special “fused” versions of common combinations of functions (such as a linear layer followed by a rectified linear layer in a neural net), and call these fused versions from Python. This means there’s a lot more library functions to implement and remember, and you’re out of luck if you’re doing anything even slightly non-standard because there won’t be a fused version for you.
We also have to deal with the lack of effective parallel processing in Python. Nowadays we all have computers with lots of cores, but Python generally will just use one at a time. There are some clunky ways to write parallel code which uses more than one core, but they either have to work on totally separate memory (and have a lot of overhead to start up) or they have to take it in turns to access memory (the dreaded “global interpreter lock” which often makes parallel code actually slower than single-threaded code!)
Libraries like PyTorch have been developing increasingly ingenious ways to deal with these performance problems, with the newly released PyTorch 2 even including a compile()
function that uses a sophisticated compilation backend to create high performance implementations of Python code. However, functionality like this can’t work magic: there are fundamental limitations on what’s possible with Python based on how the language itself is designed.
You might imagine that in practice there’s just a small number of building blocks for AI models, and so it doesn’t really matter if we have to implement each of these in C. Besides which, they’re pretty basic algorithms on the whole anyway, right? For instance, transformers models are nearly entirely implemented by multiple layers of two components, multilayer perceptrons (MLP) and attention, which can be implemented with just a few lines of Python with PyTorch. Here’s the implementation of an MLP:
nn.Sequential(nn.Linear(ni,nh), nn.GELU(), nn.LayerNorm(nh), nn.Linear(nh,ni))
…and here’s a self-attention layer:
def forward(self, x):
= self.qkv(self.norm(x))
x = rearrange(x, 'n s (h d) -> (n h) s d', h=self.nheads)
x = torch.chunk(x, 3, dim=-1)
q,k,v = (q@k.transpose(1,2))/self.scale
s = s.softmax(dim=-1)@v
x = rearrange(x, '(n h) s d -> n s (h d)', h=self.nheads)
x return self.proj(x)
But this hides the fact that real-world implementations of these operations are far more complex. For instance check out this memory optimised “flash attention” implementation in CUDA C. It also hides the fact that there are huge amounts of performance being left on the table by these generic approaches to building models. For instance, “block sparse” approaches can dramatically improve speed and memory use. Researchers are working on tweaks to nearly every part of common architectures, and coming up with new architectures (and SGD optimisers, and data augment