why compile queries?
Database query engines used to be able to assume that disk latency was so high that the overhead of interpreting the query plan didn’t matter. Unfortunately these days a cheap nvme ssd can supply data much faster than a query interpreter can process it.
The problem is that most of the cpu time is spent on the overhead of the interpreter itself – figuring out which operator is next, moving data between operators etc – rather than on the actual computation of the query. Depending on the query, removing the interpreter bottleneck can yield several orders of magnitude speedup.
For OLAP-ish queries, which tend to consist of summarizing data via full table scans over compressed column-oriented data, the standard solution is to ‘vectorize’ the interpreter – run each operator on a batch of rows instead of a single row. Then we only have to pay the interpreter overhead per batch rather than per row. This is a really well understood design and the industry is converging towards a standard set of data formats (eg arrow) and libraries (eg datafusion, velox) that work more or less out of the box.
For OLTP-ish queries, which tend to consist of searching and connecting sparse data using index lookups and hash-joins over row-oriented storage, vectorization doesn’t work as well because the inner loop of each operator still has to have dynamic parameters specifying how long a row is and how to extract the value we’re operating on. We can’t pull that interpreter overhead out because there are an infinite number of possible combinations of parameters.
Instead some OLTP databases (eg sqlserver, singlestore, redshift, oracle) compile queries to native code where those parameters can be hard-coded, allowing the values in each row to be kept in individual cpu registers throughout an entire pipeline.
Compiling queries also makes it easier to integrate external functions, foreign data wrappers, and procedural languages like PL/pgSQL directly into the generated query code, rather than trying to figure out how to vectorize code you didn’t write.
But where vectorized interpreters are becoming a commoditized component, each query compiler is a beautiful snowflake written from scratch. They are also kinda going out of fashion for reasons we’ll get into below.
why not compile queries?
Compiling a query is a bit different from compiling a regular program. Typically a program is compiled once and then run many times, so it’s worth spending a lot of time on optimization at compile-time to save time/money/energy at runtime. But a query might only be run once, so it’s only worth spending time on optimizations at compile-time if doing so will save even more time at runtime.
This is hard to get right! We see various workarounds:
- Make the user manually specify which queries to compile.
- Try to estimate at planning time whether query compilation is worthwhile.
- Start running the query in an interpreter and switch to the compiled program when it’s ready.
But it’s hard to avoid the core dilemma that interpreters run really slowly and llvm compiles really slowly, so we’re navigating around a very sharp and abrupt tradeoff. This creates unpredictable performance cliffs.
For example, a few years ago postgres added a jit compiler for scalar expressions. This update crashed the uk coronavirus dashboard! Aws and azure now completely disable the jit compiler by default.
Databases can cache the result of compiling a query but it doesn’t fully solve the compile-time problem. Consider these two queries:
SELECT * FROM movies WHERE language = 'english' AND country = 'usa'
SELECT * FROM movies WHERE language = 'swahili' AND country = 'spain'
The first query will return a substantial fraction of the database and is probably best planned as a table scan. The second query will return a tiny number of results and is probably best planned as an index lookup on language. So even though the two queries have the same structure, different parameters can produce different plans, which means we have to compile different code.
So even for OLTP workloads many databases still opt for vectorized interpreters instead of compilers, especially in recent years. There is a general perception that vectorized interpreters are easier to build and debug, and produce more predictable performance.
meanwhile in browserland
The other big place where people care about the sum of compile-time and runtime is in the browser when executing javascript/wasm. All the browsers have settled on fairly similar architectures which look roughly like:
- An interpreter, which operates either directly over the ast or over a simple bytecode.
- A baseline compiler, which focuses on emitting code as quickly as possible in a single pass.
- An optimizing compiler, which performs traditional compiler optimizations.
Programs begin running in the interpreter and then switch to compiled code when it’s available.
The details vary, of course. Some browsers can only switch to compiled code at function call boundaries whereas others are able to switch in the middle of a function (eg during a long-running loop). Javascript backends tend to only optimize hot functions, whereas wasm backends are more likely to run all functions through at least the baseline compi