When a user submits a query to a database, the optimizer translates the query string to an intermediate representation (IR) and applies various transformations to find the optimal execution plan.
Apache Calcite uses relational operators as the intermediate representation. In this blog post, we discuss the design of common relational operators in Apache Calcite.
Syntax Tree
Query optimization starts with parsing when a query string is translated into a syntax tree, which defines the syntactic structure of the query.
Since every database has a parser, the syntax tree might look like a good candidate for the intermediate representation because it is readily available to the database.
There are two significant problems with syntax tree as query’s IR:
- AST has a highly complicated structure, thanks to the involved ANSI SQL syntax. For example, a `SELECT` node may have dedicated child nodes for `FROM`, `WHERE`, `ORDER BY`, `GROUP BY`, etc.
- AST models the syntactic structure but not relational semantics. It could be problematic to map some valid relational transformations to the syntax tree. For example, a semi-join cannot be expressed easily with ANSI SQL syntax.
Combined, this makes query optimization over syntax trees challenging and not flexible.
Relational Tree
An alternative IR is a relational operator tree. We may define common relational operators, such as `Project`, `Filter`, `Join`, `Aggregate`. The query represented in such a way is much simpler to optimize because relational operators have a well-defined scope and usually have only one input (except for joins and set operators). This dramatically simplifies common relational optimizations, such as operator transposition. Also, it gives implementors flexibility to model operators independently of the database syntax rules.
The main disadvantage is the need to translate the syntax tree into a relational tree, which is often non-trivial, especially with complex syntax constructs like subqueries or common table expressions. However, the simplicity and flexibility of relational operators usually outweigh by a high margin the additional efforts on translation.
Apache Calcite parses the query into a syntax tree. Then it performs the semantic validation of the syntax tree using the SqlValidatorImpl class, resolving involved data types along the way. Finally, the validated syntax tree is converted into a tree or relational operators using the SqlToRelConverter class. The subsequent optimizations are performed on the relational tree.
In this section, we discuss the design of Apache Calcite relational operators.
Terminology
We start with several simplified definitions, which are not precise but sufficient for this blog post.
An attribute is a pair of a name and a data type. An attribute value is defined by an attribute name and value from the attribute type domain. A tuple is an unordered set of attribute values. No two attribute values in the tuple may have the same attribute name. A relation is a set of tuples. Every tuple within the relation has the same set of attributes. Relational operators take zero, one, or more input relations and produce an output relation.
Operators
To construct a tree of relational operators, we need the ability to define operator inputs. Many operators need access to attributes of the input relations. Therefore we also need the ability to reference input attributes. These are two key requirements for the relational operator interface.
In Apache Calcite, the relational operator is represented by the RelNode interface. The operator may have zero, one, or more input operators. For example, `TableScan` is an 0-ary operator, `Filter` is a unary operator, and `Union` is an N-ary operator. Every operator exposes the `RelDataType`, which is an ordered list of operator attributes. This is sufficient to construct arbitrarily complex relational trees.
Row Expressions
Operators describe various transformations to tuples. A RexNode interface defines an operation that applies to some attribute values of a tuple and produces another value. Common `RexNode` types:
- `RexLiteral` – a constant.
- `RexInputRef` – a reference to operator’s input attribute.
- `RexCall` – a function call.
For example, the expression `name = “John”` would be represented as follows.
Notice that `RexInputRef` refer