Implementing a GPU’s programming model on a CPU by luu

Share This Article

Sed ut perspiciatis unde.

SIMT

The programming model of a GPU uses what has been coined “single instruction multiple thread.” The idea is that the programmer writes their program from the perspective of a single thread, using normal regular variables. So, for example, a programmer might write something like:

int x = threadID;

int y = 6;

int z = x + y;

Straightforward, right? Then, they ask the system to run this program a million times, in parallel, with different threadIDs.

The system *could* simply schedule a million threads to do this, but GPUs do better than this. Instead, the compiler will transparently rewrite the program to use vector registers and instructions in order to run multiple “threads” at the same time. So, imagine you have a vector register, where each item in the vector represents a scalar from a particular “thread.” In the above, program, x corresponds to a vector of [0, 1, 2, 3, etc.] and y corresponds to a vector of [6, 6, 6, 6, etc.]. Then, the operation x + y is simply a single vector add operation of both vectors. This way, performance can be dramatically improved, because these vector operations are usually significantly faster than if you had performed each scalar operation one-by-one.

(This is in contrast to SIMD, or “single instruction multiple data,” where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data.)

SIMT gets complicated, though, when you have control flow. Imagine the program did something like:

if (threadID < 4) {

doSomethingObservable();

}

Here, the system has to behave as-if threads 0-3 executed the “then” block, but also behave as-if threads 4-n didn’t execute it. And, of course, thread 0-3 want to take advantage of vector operations – you don’t want to pessimize and run each thread serially. So, what do you do?

Well, the way that GPUs handle this is by using predicated instructions. There is a bitmask which indicates which “threads” are alive: within the above “then” block, that bitmask will have value 0xF. Then, all the vector instructions use this bitmask to determine which elements of the vector it should actually operate on. So, if the bitmask is 0xF, and you execute a vector add operation, the vector add operation is only going to perform the add on the 0th-3rd items in the vector. (Or, at least, it will behave “as-if” it only performed the operation on those items, from an observability perspective.) So, the way that control flow like this works is: all threads actually execute the “then” block, but all the operations in the block are predicated on a bitmask which specifies that only certain items in the vector operations should actually be performed. The “if” statement itself just modifies the bitmask.

The Project

AVX-512 is an optional instruction set on some (fairly rare) x86_64 machines. The exciting thing about AVX-512 is that it adds support for this predication bitmask thing. It has a bunch of vector registers (512 bits wide, named zmm0 – zmm31) and it also adds a set of predication bitmask registers (k0 – k7). The instructions that act upon the vector registers can be predicated on the value of one of those predication registers, to achieve the effect of SIMT.

It turns out I actually have a machine lying around in my home which supports AVX-512, so I thought I’d give it a go, and actually implement a compiler that compiles a toy language, but performs the SIMT transformation to use the vector operations and predication registers. The purpose of this exercise isn’t really to achieve incredible performance – there are lots of sophisticated compiler optimizations which I am not really interested in implementing – but instead the purpose is really just as a learning exercise. Hopefully, by implementing this transformation myself for a toy language, I can learn more about the kinds of things that real GPU compilers do.

The toy language is one I invented myself – it’s very similar to C, with some syntax that’s slightly easier to parse. Programs look like this:

function main(index: uint64): uint64 {

variable accumulator: uint64 = 0;

variable accumulatorPointer: pointer = &accumulator;

for (variable i: uint64 = 0; i < index; i = i + 1) {

accumulator = *accumulatorPointer + i;

}

return accumulator;

}

It’s pretty straightforward. It doesn’t have things like ++ or +=. It also doesn’t have floating-point numbers (which is fine, because AVX-512 supports vector integer operations). It has pointers, for loops, continue & break statements, early returns… the standard stuff.

Tour

Let’s take a tour, and examine how each piece of a C-like language gets turned into AVX-512 SIMT. I implemented this so it can run real programs, and tested it somewhat-rigorously – enough to be fairly convinced that it’s generally right and correct.

Variables and Simple Math

The most straightforward part of this system is variables and literal math. Consider:

variable accumulator: uint64;

This is a variable declaration. Each thread may store different values into the variable, so its storage needs to be a vector. No problem, right?

What about if the variable’s type is a complex type? Consider:

struct Foo {

x: uint64;

y: uint64;

}

variable bar: Foo;

Here, we need to maintain the invariant that Foo.x has the same memory layout as any other uint64. This means that, rather than alternating x,y,x,y,x,y in memory, there instead has to be a vector for all the threads’ x value, followed by another vector for all the threads’ y values. This works recursively: if a struct has other structs inside it, the compiler will to through all the leaf-types in the tree, turn each leaf type into a vector, and then lay them out in memory end-to-end.

Simple math is even more straightforward. Literal numbers have the same value no matter which thread you’re running, so they just turn into broadcast instructions. The program says “3” and the instruction that gets executed is “broadcast 3 to every item in a vector”. Easy peasy.

L-values and R-values

In a C-like language, every value is categorized as either an “l-value” or an “r-value”. An l-value is defined as having a location in memory, and r-values don’t have a location in memory. The value produced by the expression “2 + 3” is an r-value, but the value produced by the expression “*foo()” is an l-value, because you dereferenced the pointer, so the thing the pointer points to is the location in memory of the resulting value. L-values can be assigned to; r-values cannot be assigned to. So, you can say things like “foo = 3 + 4;” (because “foo” refers to a variable, which has a memory location) but you can’t say “3 + 4 = foo;”. That’s why it’s called “l-value” and “r-value” – l-values are legal on the left side of an assignment.

At runtime, every expression has to produce some value, which is consumed by its parent in the AST. E.g, in “3 * 4 + 5”, the “3 * 4” has to produce a “12” which the “+” will consume. The simplest way to handle l-values is to make them produce a pointer. This is so expressions like “&foo” work – the “foo” is an lvalue and produces a pointer that points to the variable’s storage, and the & operator receives this pointer and produces that same pointer (unmodified!) as an r-value. The same thing happens in reverse for the unary * (“dereference”) operator: it accepts an r-value of pointer type, and produces an l-value – which is just the pointer it just received. This is how expressions like “*&*&*&*&*&foo = 7;” work (which is totally legal and valid C!): the “foo” produces a pointer, which the & operator accepts and passes through untouched to the &, which takes it and passes it through untouched, all the way to the final *, which produces the same pointer as an lvalue, that points to the storage of foo.

The assignment operator knows that the thing on its left side must be an lvalue and therefore will always produce a pointer, so that’s the storage that the assignment stores into. The right side can either be an l-value or an r-value; if it’s an l-value, the assignment operator has to read from the thing it points to; otherwise, it’s an r-value, and the assignment operator reads the value itself. This is generalized to every operation: it’s legal to say “foo + 3”, so the + operator needs to determine which of its parameters are l-values, and will thus produce pointers instead of values, and it will need to react accordingly to read from the storage the pointers point to.

All this stuff means that, even for simple programs where the author didn’t even spell the name “pointer” a

Implementing a GPU’s programming model on a CPU by luu

Implementing a GPU’s programming model on a CPU by luu

Share This Article

Newsletter

SIMT

The Project

Tour

Variables and Simple Math

L-values and R-values

HackTech

Leave a comment Cancel reply

Editor's Choice

Implementing a GPU’s programming model on a CPU by luu

Implementing a GPU’s programming model on a CPU by luu

Share This Article

Newsletter

SIMT

The Project

Tour

Variables and Simple Math

L-values and R-values

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter