Writing a C compiler in 500 lines of Python by vgel

Share This Article

Sed ut perspiciatis unde.

A few months ago, I set myself the challenge of writing a C compiler in 500 lines of Python, after writing my SDF donut post.
How hard could it be?
The answer was, pretty hard, even when dropping quite a few features.
But it was also pretty interesting, and the result is surprisingly functional and not too hard to understand!

There’s too much code for me to comprehensively cover in a single blog post, so I’ll just give an overview of the decisions I made, things I had to cut, and the general architecture of the compiler, touching on a representative piece of each part.
Hopefully after reading this post, the code is more approachable!

Decisions, decisions

The first, and most critical decision, was that this would be a single-pass compiler.
500 lines is too spare to be defining and transforming an abstract syntax tree!
What does that mean?

Most compilers: faffing around with syntax trees

Well, most compiler’s internals look something like this:

The tokens get lexed, then a parser runs over them and builds pretty little syntax trees:

# hypothetical code, not from anywhere
def parse_statement(lexer) -> PrettyLittleSyntaxTree:
    ...
    if type := lexer.try_next(TYPE_NAME):
        variable_name = lexer.next(IDENTIFIER)

        if lexer.try_next("="):
            initializer = parse_initializer(lexer)
        else:
            initializer = None

        lexer.next(SEMICOLON)

        return VariableDeclarationNode(
            type = type,
            name = variable_name,
            initializer = initializer,
        )
    ...

# much later...
def emit_code_for(node: PrettyLittleSyntaxTree) -> DisgustingMachineCode:
    ...
    if isinstance(node, VariableDeclarationNode):
        slot = reserve_stack_space(node.type.sizeof())
        add_to_environment(node.name, slot)
        if node.initializer is not None:
            register = emit_code_for(node.initializer)
            emit(f"mov {register}, [{slot}]")
    ...

The important thing here is that there’s two passes, first the parsing builds up a syntax tree, then a second pass chews that tree up and turns it into machine code.
That’s really useful for most compilers!
It keeps the parsing and codegen separate, so each can evolve independently.
It also means that you can transform the syntax tree before using it to generate code—for example, by applying optimizations to it.
In fact, most compilers have multiple levels of “intermediate representations” between the syntax tree and codegen!

This is really great, good engineering, best practices, recommended by experts, etc.
But… it takes too much code, so we can’t do it.

Instead, we’ll be single-pass: code generation happens during parsing.
We parse a bit, emit some code, parse a bit more, emit a bit more code.
So for example, here’s some real code from the c500 compiler for parsing the prefix ~ op:

# lexer.try_next() checks if the next token is ~, and if so, consumes
# and returns it (truthy)
elif lexer.try_next("~"):
    # prefix() parses and generates code for the expression after the ~,
    # and load_result emits code to load it, if needed
    meta = load_result(prefix())
    # immediately start yeeting out the negation code!
    emit("i32.const 0xffffffff")
    emit("i32.xor")
    # webassembly only supports 32bit types, so if this is a smaller type,
    # mask it down
    mask_to_sizeof(meta.type)
    # return type information
    return meta

Notice there’s no syntax trees, no PrefixNegateOp nodes.
We see some tokens and immediately spit out the corresponding instructions.

You may have noticed those instructions are WebAssembly, which leads us into the next section…

Using WebAssembly, for some reason?

So I decided to make the compiler target WebAssembly.
I honestly don’t know why I did this, it really didn’t make it easier—I guess I was just curious?
WebAssembly is a really weird target, especially for C.
Besides the somewhat-external issues like spending a lot of time confused before I realized WebAssembly v2 is pretty different than WebAssembly v1, the instruction set itself is weird.

For one, there’s no goto.
Instead, you have blocks—structured assembly, imagine that!—and “break” instructions that jump to either the beginning or end of a specific nesting-level of block.
This was basically inconsequential for if and while, but made implementing for extremely cursed, which we’ll go over later.

Additionally, WebAssembly doesn’t have registers, it has a stack, and is a stack machine.
At first you might think that’s awesome, right?
C needs a stack!
We can just use the WebAssembly stack as our C stack!
Nope, because you can’t take references to the WebAssembly stack.
So instead, we need to maintain our own in-memory stack anyways, and then shuffle it on and off of the WASM parameter stack.

So in the end, I think I ended up with slightly more code than I would have needed to target a more normal ISA like x86 or ARM.
But it was interesting!
And theoretically, you could run code compiled with c500 in a browser, although I haven’t tried (I just use the wasmer CLI).

Error handling

It basically doesn’t.
There’s a function die, which is called when anything weird happens and dumps a compiler stack trace—if you’re lucky, you get a line number and a somewhat-vague error message.

------------------------------

  File "...compiler.py", line 835, in 
    compile("".join(fi))  # todo: make this line-at-a-time?
  File "...compiler.py", line 823, in compile
    global_declaration(global_frame, lexer)
  
  File "...compiler.py", line 417, in value
    var, offset = frame.get_var_and_offset(varname)
  File "...compiler.py", line 334, in get_var_and_offset
    return self.parent.get_var_and_offset(name)
  File "...compiler.py", line 336, in get_var_and_offset
    die(f"unknown variable {n}", None if isinstance(name, str) else name.line)
  File "...compiler.py", line 14, in die
    traceback.print_stack()

------------------------------

error on line 9: unknown variable c

The Rust compiler, this is not :-)

What to drop

Finally, I had to decide what not to support, since it just wasn’t feasible to get all of C into 500 lines. (sorry!)
I decided I wanted a really decent sampling of features that tested what the general implementation approach was capable of—for example, if I had skipped pointers, I could have just gotten away with the WASM parameter stack and shed a lot of complexity, but that would have felt like cheating.

I ended up implementing the following features:

arithmetic operations and binary operators, with proper precedence
int, short, and char types
string constants (with escapes)
pointers (of however many levels), including correct pointer arithmetic (incrementing an int* adds 4)
arrays (only single-level, not int[][])
functions
typedefs (and the lexer hack!)

Notably, it doesn’t support:

structs :-( would be possible with more code, the fundamentals were there, I just couldn’t squeeze it in
enums / unions
preprocessor directives (this would probably be 500 lines by itself…)
floating point. would also be possible, the wasm_type stuff is in, again just couldn’t squeeze it in
8 byte types (long/long long or double)
some other small things like pre/post cremements, in-place initialization, etc., which just didn’t quite fit
any sort of standard library or i/o that isn’t returning an integer from main()
casting expressions

The compiler passes 34/220 test cases in the c-testsuite.
More importantly to me, it can compile and run the following program successfully:

int swap(int* a, int* b) {
  int t;
  t = *a; *a = *b; *b = t;
  return t;
}

int fib(int n) {
  int a, b;
  for (a = b = 1; n > 2; n = n - 1) {
    swap(&a, &b);
    b = b + a;
  }
  return b;
}

int main() {
  return fib(10); // 55
}

OK, enough about deciding things, let’s get into the code!

Helper types

There’s a small collection of helper types and classes that the compiler uses.
None of them are particularly strange, so I’ll pass over them fairly quickly.

`Emitter` (compiler.py:21)

This is a singleton helper to emit nicely-formatted WebAssembly code.

WebAssembly, at least the textual format, is formatted as s-expressions, but individual instructions don’t need to be parenthesized:

(module
  ;; 
  (func $swap
    (param $a i32)
    (param $b i32)
    (result i32)
    global.get $__stack_pointer ;; prelude -- adjust stack pointer
    i32.const 12
    i32.sub
    ;; 
  )
)

Emitter just helps with emitting code with nice indentation so it’s easier to read.
It also has a no_emit method, which will be used for an ugly hack later—stay tuned!

StringPool (compiler.py:53)

StringPool holds all the string constants so they can be arranged in a contiguous region of memory, and hands out addresses into that for the codegen to use.
When you write char *s = "abc" in c500, what really happens is:

StringPool appends a null terminator
StringPool checks if it’s already stored "abc", and if so, just hands that address back
Otherwise, StringPool adds it to a dictionary along with the base address + the total byte length stored so far—the address of this new string in the pool
StringPool hands that address back
When all the code is finished compiling, we create an rodata section with the giant concatenated string produced by StringPool, stored at the string pool base address (retroactively making all the addresses StringPool handed out valid)

`Lexer` (compiler.py:98)

The Lexer class is complex, because lexing C is complex ((\([\abfnrtv'"?]|[0-7]{1,3}|x[A-Fa-f0-9]{1,2})) is a real regex in that code for character escapes), but conceptually simple: the lexer marches along identifying what the token at the current position is.
The caller can peek that token, or it can use next to tell the lexer to advance, “consuming” that token.
It can also use try_next to conditionally advance only if the next token is a certain kind—basically, try_next is a shortcut for if self.peek().kind == token: return self.next().

There’s some additionally complexity because of something called the “lexer hack”.
Essentially, when parsing C you want to know if something is a type name or variable name (because that context matters for compiling certain expressions), but there’s no syntactic distinction between them: int int_t = 0; is perfectly valid C, as is typedef int int_t; int_t x = 0;.

To know if an arbitrary token int_t is a type name or a variable name, we need to feed type information from the parsing/codegen stage back into the lexer.
This is a giant pain for regular compilers that want to keep their lexer, parser, and codegen modules pure and plantonically separate, but it’s actually not very hard for us!
I’ll explain it more when we get to the typedef section, but basically we just keep types: set[str] in Lexer, and when lexing, check if a token is in that set before giving it a token kind:

if m := re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*", self.src[self.loc :]):
    tok = m.group(0)
    ...
    # lexer hack
    return Token(TOK_TYPE if tok in self.types else TOK_NAME, tok, self.line)

`CType` (compiler.py:201)

This is just a dataclass for representing information about a C type, like you’d write in int **t or short t[5] or char **t[17], minus the t.

It contains:

the type’s name (with any typedefs resolved), such as int or short
what level of pointer is is (0 = not a pointer, 1 = int *t, 2 = int **t, and so on)
what the array size is (None = not an array, 0 = int t[0], 1 = int t[1], and so on)

Notably, as mentioned before, this type only supports single-level arrays, and not nested arrays like int t[5][6].

`FrameVar` and `StackFrame` (compiler.py:314)

These classes handle our C stack frames.

As I mentioned before, because you can’t take references to the WASM stack, we have to manually handle the C stack, we can’t use the WASM one.

To set up the C stack, the prelude emitted in __main__ sets up a global __stack_pointer variable, and then every function call decrements that by however much space the function needs for its parameters and local variables—calculated by that function’s StackFrame instance.

I’ll go over how that calculation works in more detail when we get to parsing functions, but essentially, each parameter and local variable gets a slot in that stack space, and increases StackFrame.frame_size (and thus the offset of the next variable) depending on its size.
The offset, type information, and other data for each parameter and local variable are stored in a FrameVar instance, in StackFrame.variables, in order of declaration.

`ExprMeta` (compiler.py:344)

This final dataclass is used to track whether the result of an expression is a value or a place.
We need to keep track of this distinction in order to handle certain expressions differently based on how they’re used.

For example, if you have a variable x of type int, it can be used in two ways:

x + 1 wants the value of x, say 1, to operate on
&x wants the address of x, say 0xcafedead

When we parse the x expression, we can easily fetch the address from the stack frame:

# look the variable up in the `StackFrame`
var, offset = frame.get_var_and_offset(varname)
# put the base address of the C stack on top of the WASM stack
emit(f"global.get $__stack_pointer")
# add the offset (in the C stack)
emit(f"i32.const {offset}")
emit("i32.add")
# the address of the variable is now on top of the WASM stack

But now what?
If we i32.load this address to get the value, then &x will have no way to get the address.
But if we don’t load it, then x + 1 will try to add one to the address, resulting in 0xcafedeae instead of 2!

That’s where ExprMeta comes in: we leave the address on the stack, and return an ExprMeta indicating this is a place:

return ExprMeta(True, var.type)

Then, for operations like + that always want to operate on values instead of places, there’s a function load_result that turns any places into values:

def load_result(em: ExprMeta) -> ExprMeta:
    """Load a place `ExprMeta`, turning it into a value
    `ExprMeta` of the same type"""
    if em.is_place:
        # emit i32.load, i32.load16_s, etc., based on the type
        emit(em.type.load_ins())
    return ExprMeta(False, em.type)

...
# in the code for parsing `+`
lhs_meta = load_result(parse_lhs())
...

Meanwhile, an operation like & just doesn’t load the result, and instead leaves the address on the stack: in an important sense, & is a no-op in our compiler, since it doesn’t emit any code!

if lexer.try_next("&"):
    meta = prefix()
    if not meta.is_place:
        die("cannot take reference to value", lexer.line)
    # type of &x is int* when x is int, hence more_ptr
    return ExprMeta(False, meta.type.more_ptr())

Note also that, despite being an address, the result of & isn’t a place! (The code returns an ExprMeta with is_place=False.)
The result of & should be treated like a value, since &x + 1 should add 1 (or rather, sizeof(x)) to the address.
That’s why we need the place/value distinction, since just “being an address” isn’t enough to know whether the result of an expression should be loaded.

OK, enough about helper classes.
Let’s move on to the meat of codegen!

Parsing and code generation

The general control flow of the compiler goes like this:

The blue rectangles represent the main functions of the compiler—__main__, compile(), global_declaration(), statement(), and expression()

Writing a C compiler in 500 lines of Python by vgel

Writing a C compiler in 500 lines of Python by vgel

Share This Article

Newsletter

Decisions, decisions

Most compilers: faffing around with syntax trees

Using WebAssembly, for some reason?

Error handling

What to drop

Helper types

`Emitter` (compiler.py:21)

StringPool (compiler.py:53)

`Lexer` (compiler.py:98)

`CType` (compiler.py:201)

`FrameVar` and `StackFrame` (compiler.py:314)

`ExprMeta` (compiler.py:344)

Parsing and code generation

HackTech

Leave a comment Cancel reply

Editor's Choice

Writing a C compiler in 500 lines of Python by vgel

Writing a C compiler in 500 lines of Python by vgel

Share This Article

Newsletter

Decisions, decisions

Most compilers: faffing around with syntax trees

Using WebAssembly, for some reason?

Error handling

What to drop

Helper types

Emitter (compiler.py:21)

StringPool (compiler.py:53)

Lexer (compiler.py:98)

CType (compiler.py:201)

FrameVar and StackFrame (compiler.py:314)

ExprMeta (compiler.py:344)

Parsing and code generation

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

`Emitter` (compiler.py:21)

`Lexer` (compiler.py:98)

`CType` (compiler.py:201)

`FrameVar` and `StackFrame` (compiler.py:314)

`ExprMeta` (compiler.py:344)