The Ormolu project, led by Mark Karpov at Tweag, has been providing a reliable way to format Haskell source files for a few years now. It makes use of the parser of GHC itself, so as to prevent parsing issues that unfortunately some other Haskell formatters have.
Alexander Esgen detailed several improvements he made to Ormolu during his internship in a previous blog post. In turn, this blog post will be focusing on the challenge of formatting infix operator chains in Haskell, which is notoriously difficult and has been my primary goal for the last two months.
Operator chains
Almost all programming languages allow the use of infix operators, especially for the four basic numeric operators, +
, -
, *
and /
. Moreover, most of the time, these operators can be “chained” together without parentheses, and in that case, fixity rules determine the meaning of an expression. Indeed, it’s really neat to write 1 + 2 * 5 * 6
instead of 1 + ((2 * 5) * 6)
, which would be required without such rules. From now on, I will call such a sequence of operands interleaved with infix operators an operator chain.
An operator’s fixity is composed of
- A precedence, which can be seen as a priority level. An infix operator with a higher precedence level will bind more tightly than an operator with a lower precedence level (e.g.
*
has precedence level 7, whereas+
has precedence level 6 in Haskell). - An associativity direction: an infix operator can be either left associative (
infixl
), right associative (infixr
), or non-associative (infix
) and indicates, in the absence of parentheses, in which order should the computations be made inside a group of operators having the same precedence level.
When several operators in a chain share the same precedence level (e.g. 1 + 2 - 3 - 4
), no operator has higher priority than the other, but we still need to order the computations. In order to do so, we need to look at the associativity direction of each operator.
The associativity direction is different from the mathematical definition of associativity: +
and *
are associative in the mathematical sense (because (a + b) + c = a + (b + c)
and (a * b) * c = a * (b * c)
), whereas -
and /
are not. However, all of these four operators are left associative, which means that, in a chain of +
and -
, or in a chain of *
and /
, the operations should be computed left-to-right: 1 + 2 - 3 - 4 = ((1 + 2) - 3) - 4
. On the other hand, the mathematical exponentiation, ^
, is right associative: 2^3^3
means 2^(3^3)
.
A non-associative operator cannot be chained with operators of the same precedence level, and often indicates that the operator has a different return type than those of its left and right operands (e.g. /=
is non-associative, which is a sensible choice given its signature Ord a => a -> a -> Bool
: 1 /= 2 /= 3
would make no sense). Likewise, two operators with the same precedence level, but with different associativity directions cannot be chained together without explicit parentheses. Fortunately, that particular combination is uncommon in practice.
Formatting issues with the previous version of Ormolu
According to Ormolu’s core principles, any single-line expression should stay single-line, and as a result, Ormolu doesn’t need to do much work for a single-line operator chain, except from checking that spaces are effectively put around each infix operator.
For multi-line operator chains, however, it’s a completely different story. As soon as one line break is introduced in an operator chain by the programmer, Ormolu is allowed to reformat the chain in a multi-line fashion, as long as the AST remains the same. But then, what should we do to produce the most readable and least surprising output?
In the previous version of Ormolu, operator chains weren’t represented by a sequence of operands interleaved with infix operators, but rather as a binary tree data OpTree = OpNode | OpBranch OpTree Operator OpTree
, directly extracted from the AST returned by ghc-lib-parser
.
At this point, it is crucial to remember that Haskell allows its users to define custom infix operators, in any module. As a result, the fixity of each infix operator can’t be known when a source file is parsed, because this operator could have been defined in any other source file of the project or of its dependencies. As a result, any operator chain is made into a degenerated binary tree in the AST, as if all operators had the same precedence level and were left associative.
1 + 4 * 1 - 2
becomes at the parsing stage:
-
/
* 2
/
+ 1
/
1 4
Ormolu used to format operator chains recursively in their binary tree form. This strategy implied that the deeper we got in the tree, the less information we had about the whole chain and its multi-line/single-line context. The most visible consequence of this strategy was the inconsistent number of operands on each line in a multi-line operator chain:
chain1 =
1 + 2 + 3 + 4
+ 5 + 6
would become
chain1 =
1 + 2 + 3 + 4
+ 5
+ 6
whereas this chain:
chain2 =
1 +
2 + 3 + 4 + 5 + 6
would become
chain2 =
1
+ 2
+ 3
+ 4
+ 5
+ 6
How to improve the situation
Two general styles are often considered for multi-line constructs (their names are taken from the JetBrains IDEs’ formatter):
- The wrap strategy, for which line breaks are introduced so as to meet a line length goal (for exemple, no more than 80 characters per line of code). As a result, the number of operands and operators on each line is variable, and the strategy is not really diff-friendly.
- The chop down strategy, where every pair of operator and operand sits on its own line, without taking length into account.
Ormolu has no line goal system, so the first strategy is definitely