academic, developer, with an eye towards a brighter techno-social life
[prev]
[next]
2022-03-30
I wrote a peephole optimizer (for QBE), part 1
All source code for this blog post can be found here.
One of the things we did not tackle when writing our PL/0 compiler was an optimizer. Our code generator output C and then we let our C compiler handle the rest. If we wanted optimizations, we could turn them on in the C compiler. That’s great but it didn’t help us learn anything about optimizations.
I test a lot of C compilers when testing oksh. One of those compilers is called C11 compiler that compiles C into QBE and then has QBE transform its intermediate language into assembly, which then follows the usual pipeline of being run through the system assembler and ultimately the system linker.
We saw QBE before when we wrote our brainfuck compiler in QBE IL. When observing how QBE transforms its IL into assembly, we noticed that it missed a chance at optimization in the case where you moved a zero into a register. The generated assembly used a mov
instruction to move an immediate of zero into the register. But the fastest and smallest way to put a zero into a register (at least, on amd64 machines) is to xor
the register with itself.
I noticed the lastest cproc
, which uses the latest QBE, still fails to make this optimization. At one point in time, the author of cproc
sent me a diff for QBE that did make the optimization but I guess it was never applied to the QBE repository. As the savings for this optimization really can add up over a large project, and since it gave us an opportunity to look at peephole optimizers, I decided to write a small program that could be run in the cproc
pipeline in between QBE and the system assembler to fix up this optimization. Maybe we could find more optimizations as well.
A peephole optimizer
The simplest type of optimizer that we can write without having to dive into the vast literature on compiler optimizations is a peephole optimizer. In this technique, we read in some number of lines of assembly and perform basic pattern matching on those lines looking for instructions that we know can be replaced with better ones (or entirely eliminated). “Better” in this case brings up the first interesting question about optimization: better can be contextual based on your goals. Do you want instructions that take less time to execute? Do you want instructions that require fewer bytes to encode, making for a smaller program? Someting else? What is faster may not be smaller. What is smaller may not be faster. For us with our first optimization, we are lucky that the optimized version is both faster and smaller.
The peephole, or window, for our first optimization is just one line. There are some other peephole optimizations that can be performed with a peephole of just one line: removing useless instructions, such as moving a register into itself or adding zero to a register, immediately come to mind.
After we’ve completed the mov
to xor
optimization, we’ll tackle an additional optimization that requires a larger peephole to enable.
We will name our peephole optimizer O
in honor of the C compiler flag (-O
) that is commonly used to tell the C compiler to turn on its optimizer.
Choosing a language
I am going to write O
in C. Not because C is necessarily the best language for this task; it probably isn’t. But because if we write O
in C, then we can run cproc
on O
itself and determine the size savings on itself. I am also going to rely on direct knowledge of cproc
to make our lives easier when figuring out what strings to match on. There is also a little bit of argc
and argv
handling with that knowledge so that O
can be plugged into cproc
when we are finished.
With all of that said, our main
function looks like this:
static int usage(void) { (void) fputs("usage: O in.s [-o out.s]n", stderr); return 1; } int main(int argc, char *argv[]) { FILE *fp; if (argc == 4) { if (strcmp(argv[2], "-o") != 0) return usage(); if (freopen(argv[3], "w+", stdout) == NULL) { (void) fprintf(stderr, "O: error: couldn't open %sn", argv[3]); } } else if (argc != 2) { return usage(); } if (!strcmp(argv[1], "-")) { O(stdin); return 0; } if ((fp = fopen(argv[1], "r")) == NULL) { (void) fprintf(stderr, "O: error: couldn't open %sn", argv[1]); return 1; } O(fp); (void) fclose(fp); return 0; }
Because of how cproc
works, when the final stage of the compilation is to output assembly (-S
), cproc
automatically appends -o [file.s]
so O
needs to support that. If cproc
is going to write an object file or binary, then O
will write to stdout
and the assembler will read the output in via a pipe. Additionally, input from cproc
is always stdin
, demarcated as -
, so we need to support that too.
To avoid too much complexity, if we do have to output to a file, we can freopen(3)
that file to stdout
. Most of the time, we will be outputting to stdout
so it makes sense to make outputting to a file the special case.
Reading in one line at a time
We can use the getline(3)
function to read in one line at a time into a buffer. While the getline(3)
function is “new” with POSIX.1-2008, I think all modern Unix systems have the function. If your system does not, you can get a copy of getline(3)
here.
Reading in one line at a time might look something like this:
static void O(FILE *fp) { char *line = NULL; size_t size = 0; while (getline(&line, &size, fp) != -1) (void) fputs(line, stdout); free(line); }
Yes, you really do need the call to free(3)
at the end there.
This would read in one line at a time and then immediately print it out. Of course, we need to perform a check to match for potential lines to replace, so let’s improve this a bit:
static void one(const char *line) { if (xorq(line)) return; if (xorl(line)) return; (void) fputs(line, stdout); } static void O(FILE *fp) { char *line = NULL; size_t size = 0; while (getline(&line, &size, fp) != -1) one(line); free(line); }
Now we are reading in one line at a time, and checking if it’s a movq
line that can be replaced with an xorq
line. If not, perhaps it is a movl
line that can be replaced with an xorl
line. If not, then it is not a line that can be optimized this way so we should just print it out as-is.
Find and replace
Because we know exactly what QBE is going to output, we can use that to our advantage. We don’t need to tokenize the output; we can simply check for exact strings. That saves us a lot of logic. A movq
line that we want to match will always look like this:
movq $0, %r__
Where __
reflects any of the 64-bit registers. Therefore, we can match exactly that and since that’s the only form this line can be in, it is the only thing we need to check for. We want to replace that movq
line with this:
xorq %r__, %r__
If we were to write that as a function, it might look like this:
static int xorq(const char *line) { if (!strncmp("tmovq $0, %r", line, 12)) { (void) fprintf(stdout, "txorq %%r%c%c, %%r%c%cn", line[12], line[13], line[12], line[13]); return 1; } return 0; }
If you wanted to be paranoid, you could check that strlen(line) > 13
before calling fprintf
.
We check to see if the first twelve characters of the line match a movq
line that would be better if it was optimized into a xorq
line. If it does, we instead write the equivalent xorq
line. We are able to use the fact tha