View it on GitHub Join our Slack workspace

by Semih Salihoğlu, Feb 22nd, 2023

Joins of a sets of records is objectively the most expensive operation in DBMSs. In my previous post on factorization, I said that in the field of databases, once in a while you run into a very simple idea that deviates from the norm that gets you very excited. Today, I will discuss another such idea, worst-case optimal join (wcoj) algorithms. Wcoj algorithms and the theory around it in one sentence says this:

  • Queries involving complex “cyclic joins” over many-to-many relationships should be evaluated column at a time instead of table at a time, which is the norm. Wcoj algorithms find their best applications when finding cyclic patterns on graphs, such as cliques or cycles, which is common in the workloads of fraud detection and recommendation applications. As such, they should be integrated into every graph DBMS (and possibly to RDBMSs) and I am convinced that they eventually will.

Tldr: The key takeaways are:

  • History of Wcoj Algorithms: Research on wcoj algorithms started with a solution to open question about the maximum sizes of join queries. This result made researchers realize this: the traditional “binary join plans” paradigm of generating query plans that join 2 tables a time until all of the tables in the query are joined is provably suboptimal for some queries. Specifically, when join queries are cyclic, which in graph terms means when the searched graph pattern has cycles in it, and the relationships between records are many-to-many, then this paradigm can generate unnecessarily large amounts of intermediate results.
  • Core Algorithmic Step of Wcoj Algorithms: Wcoj algorithms fix this sub-optimality by performing the joins one column at a time (instead of 2 tables at a time) using multiway intersections.
  • How Kùzu Integrates Wcoj Algorithms: Kùzu generates plans that seamlessly mix binary joins and wcoj-style multiway intersections. Multiway intersections are performed by an operator called “multiway HashJoin”, which has one or more build phases that creates one or more hash tables that stores sorted adjacency lists; and a probe phase that performs multi-way intersections using the sorted lists.
  • Yes, the Term “Worst-case Optimal” Is Confusing Even to Don Knuth: I know, Don Knuth also found the term “worst-case optimal” a bit confusing. See my anecdote on this. It basically means that the worst-case runtimes of these algorithms are asymptotically optimal.

Joins, Running Example & Traditional Table-at-a-time Joins

Joins are objectively the most expensive and powerful operation in DBMSs. In SQL, you indicate them in the FROM clause by listing a set of table names, in Cypher in the MATCH clause, where you draw a graph pattern to describe how to join node records with each other. As a running example, consider a simple social network of users and followers, whose node-link diagram is shown below. I am also showing the table that contains these records in a User (ignore the name property for now) and Follows tables.

Consider finding triangles, which is one of the simplest forms of cycles and cliques, in this network. The SQL and Cypher versions of this query are shown below.

SQL:
SELECT *
FROM  Follows f1, Follows f2, Follows f3
WHERE f1.dst=f2.src AND f2.dst=f3.src AND
      f3.dst = f1.src

Cypher:
MATCH (a:User)-[f1:Follows]->(b:User)-[f2:Follows]->(c:User)-[f3:Follows]->(a)
RETURN  *

That long MATCH clause “draws” a triangle and for our case here, this is equivalent to joining three copies of the Follows table.

Now ever since the System R days and Patricia Selinger’s 1979 seminal paper that described how System R compiled and optimized SQL queries, there has been an unchallenged dogma in DBMSs that the joins specified in the query would be evaluated pairwise, table at a time. Here’s a blurb from Selinger’s paper, where one can see this assumption: “In System R a user need not know how the tuples are physically stored … Nor does a user specify in what order joins are to be performed. The System R optimizer chooses both join order and …” To this day, this is the norm. DBMSs pick a “join order” which is the order in which the tables should be joined iteratively 2 at a time. In the above example, for example there are three possible join orders. One way to represent these orders is by writing different parenthesization of the joins:

  • (i) $((F1 bowtie F2) bowtie F3)$; (ii) $(F1 bowtie (F2 bowtie F3))$; and (iii) $((F1 bowtie F3) bowtie F2)$.

The optimization problem for a system is of course more complex than just ordering tables because the system also has to choose which binary join algorithm to use when joining each pair of tables, e.g., hash joins vs merge joins. But take any system you want, and they will all follow the same paradigm of joining 2 base or intermediate tables iteratively, until all tables are joined: hence the term binary joins to describe the plans of existing systems.

A Math Puzzle That Started it All

So, what’s the problem with binary join plans? When join queries are cyclic and the relationships are many-to-many, they can generate provably large amounts of (so unnecessary in a formal sense) intermediate results. First, cyclicity for join queries has formal (and a bit intimidating) definitions but if you think of graph patterns, it simply means that the searched pattern’s undirected version has cycles. Why do binary joins generate unnecessarily large intermediate results? I’ll get to this below but first a bit of history on the origins of this insight. The whole topic of “worst-case optimal joins” started with 2 papers, a 2007 SODA and a 2008 FOCS paper, which are top venues in algorithms and theory. In these papers, several theoreticians solved a fundamental open question about join queries. Suppose I give you:

  1. An arbitrary natural join query, say of $m$ relations. In DBMS literature we denote such queries as $Q=R1(a_{11}, …, a_{r1}) bowtie … bowtie Rm(a_{m1}, …, a_{rm})$.
  2. Sizes of R1, …, Rm, e.g., for simplicity assume they all have $IN$ many tuples.

“Natural” here means that the join predicates are equality predicates on identical column names. You, as the second person in this puzzle, are allowed to set the values inside these relations. The open question was: how large can you make the final output? So for example, if I told you that there are $IN$ many tuples in the Follows tables, what is the maximum number of triangle outputs there can be?1 Even more concretely for the triangle query, the question is: out of all possible graphs with $IN$ many edges, what is the maximum number of triangles they contain?

It still surprises me that the answer to this question was not known until 2008. It just looks like a fundamental question someone in databases must have answered before. Now excuse me for bombarding your brains with some necessary math definitions. These two papers showed that the answer is: $IN^{rho^*}$, where $rho^*$ is a property of $Q$ called the fractional edge cover number of $Q$. This is the solution to an optimization problem and best explained by thinking about the “join query graph”, which, for our purposes, is the triangle graph pattern (ignoring the edge directions), shown in Fig 2a and 2b.

The optimization problem is this: put a weight between [0, 1] to each “query edge” such that each “query node” is “covered”, i.e., the sum of the query edges touching each query node is > 1. Each such solution is called an edge cover. The problem is to find the edge cover