Intro
This document describes a stable bottom-up adaptive branchless merge sort named quadsort. A visualisation and benchmarks are available at the bottom.
The quad swap
At the core of quadsort is the quad swap. Traditionally most sorting
algorithms have been designed using the binary swap where two variables
are sorted using a third temporary variable. This typically looks as
following.
stage the four variables are partially sorted in the four
swap variables, in the second stage they are fully sorted back to the
original four variables.
╭─╮ ╭─╮ ╭─╮ ╭─╮ │A├─╮ ╭─┤S├────────┬─────────┤?├─╮ ╭───┤F│ ╰─╯ │ ╭─╮ │ ╰─╯ │ ╰┬╯ │ ╭┴╮ ╰─╯ ├───┤?├───┤ │ ╭──╯ ╰───┤?│ ╭─╮ │ ╰─╯ │ ╭─╮ │ │ ╰┬╯ ╭─╮ │A├─╯ ╰─┤S├────────│────────╮ ╰───┤F│ ╰─╯ ╰┬╯ │ ││ ╰─╯ ╭┴╮ ╭─╮ ╭┴╮ ╭─╮ ││ │?├─┤F│ │?├─┤F│ ││ ╰┬╯ ╰─╯ ╰┬╯ ╰─╯ ││ ╭─╮ ╭┴╮ │ ││ ╭─╮ │A├─╮ ╭─┤S├────────│───────╯│ ╭───┤F│ ╰─╯ │ ╭─╮ │ ╰─╯ │ ╰─╮ ╭┴╮ ╰─╯ ├───┤?├───┤ │ │ ╭───┤?│ ╭─╮ │ ╰─╯ │ ╭─╮ │ ╭┴╮ │ ╰┬╯ ╭─╮ │A├─╯ ╰─┤S├────────┴─────────┤?├─╯ ╰───┤F│ ╰─╯ ╰─╯ ╰─╯ ╰─╯
This process is visualized in the diagram above.
After the first round of sorting a single if check determines if the four
swap variables are sorted in-order, if that’s the case the swap
finishes up immediately. Next it checks if the swap variables
are sorted in reverse-order, if that’s the case the sort finishes up
immediately. If both checks fail the final arrangement is known and
two checks remain to determine the final order.
This eliminates 1 wasteful comparison for in-order sequences while creating
1 additional comparison for random sequences. However, in the real world we
are rarely comparing truly random data, so in any instance where data is
more likely to be orderly than disorderly this shift in probability will give
an advantage.
In C the basic quad swap looks as following:
if (val[0] > val[1]) { swap[0] = val[1]; swap[1] = val[0]; } else { swap[0] = val[0]; swap[1] = val[1]; } if (val[2] > val[3]) { swap[2] = val[3]; swap[3] = val[2]; } else { swap[2] = val[2]; swap[3] = val[3]; } if (swap[1] <= swap[2]) { val[0] = swap[0]; val[1] = swap[1]; val[2] = swap[2]; val[3] = swap[3]; } else if (swap[0] > swap[3]) { val[0] = swap[2]; val[1] = swap[3]; val[2] = swap[0]; val[3] = swap[1]; } else { if (swap[0] <= swap[2]) { val[0] = swap[0]; val[1] = swap[2]; } else { val[0] = swap[2]; val[1] = swap[0]; } if (swap[1] <= swap[3]) { val[2] = swap[1]; val[3] = swap[3]; } else { val[2] = swap[3]; val[3] = swap[1]; } }
In-place quad swap
There are however several problems with the simple quad swap above. If an array is already fully sorted it writes a lot of data back and forth from swap unnecessarily. If an array is fully in reverse order it will change 8 7 6 5 4 3 2 1 to 5 6 7 8 1 2 3 4 which reduces the degree of orderliness rather than increasing it.
To solve these problems the quad swap needs to be implemented in-place.
Reverse order handling
Reverse order data is typically handled using a simple reversal function, as following.
int reverse(int array[], int start, int end, int swap) { while (start < end) { swap = array[start]; array[start++] = array[end]; array[end--] = swap; } }
While random data can only be sorted using n log n comparisons and
n log n moves, reverse-order data can be sorted using n comparisons
and n moves through run detection. Without run detection the best you
can do is sort it in n comparisons and n log n moves.
Run detection, as the name implies, comes with a detection cost. Thanks
to the laws of probability a quad swap can cheat however. The chance of
4 random numbers having the order 4 3 2 1 is 1 in 24. So when sorting
random data we'll only make a wasteful run check in 4.16% of cases.
What about run detection for in-order data? While we're turning
n log n moves into n moves with reverse order run detection, we'd be
turning 0 moves into 0 moves with forward run detection. So there's
no point in doing so.
The next optimization is to write the quad swap in such a way that we can
perform a simple check to see if the entire array was in reverse order,
if so, the sort is finished.
One final optimization, reverse order handling is only beneficial on
runs longer than 4 elements. When no reverse order run is detected
the next 4 elements are merged with the first 4 elements. This reduces
the chance of a wasteful run check to 2.08%.
At the end of the loop the array has been turned into a series of ordered
blocks of 8 elements.
Ping-Pong Quad Merge
Most textbook mergesort examples merge two blocks to swap memory, then copy
them back to main memory.
main memory ┌────────┐┌────────┐
└────────┘└────────┘
↓ merge ↓
swap memory ┌──────────────────┐
└──────────────────┘
↓ copy ↓
main memory ┌──────────────────┐
└──────────────────┘
This doubles the amount of moves and we can fix this by merging 4 blocks at once
using a quad merge / ping-pong merge like so:
main memory ┌────────┐┌────────┐┌────────┐┌────────┐
└────────┘└────────┘└────────┘└────────┘
↓ merge ↓ ↓ merge ↓
swap memory ┌──────────────────┐┌──────────────────┐
└──────────────────┘└──────────────────┘
↓ merge ↓
main memory ┌──────────────────────────────────────┐
└──────────────────────────────────────┘
It is possible to interleave the two merges to swap memory for increased memory-level parallelism, but this can both increase and decrease performance.
Skipping
Just like with the quad swap it is beneficial to check whether the 4 blocks
are in-order.
In the case of the 4 blocks being in-order the merge operation is skipped,
as this would be pointless. Because reverse order data is handled in the
quad swap there is no need to check for reverse order blocks.
This allows quadsort to sort in-order sequences using n comparisons instead
of n * log n comparisons.
Parity merge
A parity merge takes advantage of the fact that if you have two n length arrays,
you can fully merge the two arrays by performing n merge operations on the start
of each array, and n merge operations on the end of each array. The arrays must
be of exactly equal length.
The main advantage of a parity merge over a traditional merge is that the loop
of a parity merge can be fully unrolled.
If the arrays are not of equal length a hybrid parity merge can be performed. One
way to do so is using n parity merges where n is the size of the smaller array,
before switching to a traditional merge.
Branchless parity merge
Since the parity merge can be unrolled it's very suitable for branchless
optimizations to speed up the sorting of random data. Another advantage
is that two separate memory regions are accessed in the same loop, allowing
memory-level parallelism. This makes the routine up to 2.5 times faster for
random data on most hardware.
Increasing the memory regions from two to four can result in both performance
gains and performance losses.
The following is a visualization of an array with 256 random elements getting
turned into sorted blocks of 32 elements using ping-pong parity merges.
Quad galloping merge
While a branchless parity merge sorts random data faster, it sorts ordered data
slower. One way to solve this problem is by using a method with a resemblance
to the galloping merge concept first introduced by timsort.
The quad galloping merge works in a similar way to the quad swap.
Instead of merging the ends of two arrays two items at a time, it merges
four items at a time.
┌───┐┌───┐┌───┐ ┌───┐┌───┐┌───┐ ╭───╮ ┌───┐┌───┐┌───┐
│ A ││ B ││ C │ │ X ││ Y ││ Z │ ┌───│BY├──┤ X ││ Y ││A/Z│
╰─┬─╯ └───┘└───┘└───┘
│ ┌───┐┌───┐┌───┐
└────│A/X││X/A││B/Y│
└───┘└───┘└───┘
When merging ABC and XYZ it first checks if B is smaller or equal to X. If
that's the case A and B are copied to swap. If not, it checks if A is greater
than Y. If that's the case X and Y are copied to swap.
If either check is false it's known that the two remaining distributions
are A X and X A. This allows performing an optimal branchless merge. Since
it's known each block still has at least 1 item remaining (B and Y) and
there is a high probability of the data to be random, another (sub-optimal)
branchless merge can be performed.
In C this looks as following:
while (l < l_size - 1 && r < r_size - 1) { if (left[l + 1] <= right[r]) { swap[s++] = left[l++]; swap[s++] = left[l++]; } else if (left[l] > right[r + 1]) { swap[s++] = right[r++]; swap[s++] = right[r++]; } else { a = left[l] > right[r]; x = !a; swap[s + a] = left[l++]; swap[s + x] = right[r++]; s += 2; } }
Overall the quad galloping merge gives a slight performance gain for both ordered and random data.
Merge strategy
Quadsort will merge blocks of 8 into blocks of 32, which it will merge into
blocks of 128, 512, 2048, 8192, etc.
For each ping-pong merge quadsort will perform two comparisons to see if it will be faster
to use a parity merge or a quad galloping merge, and pick the best option.
Tail merge
When sorting an array of 33 elements you end up with a sorted array of 32
elements and a sorted array of 1 element in the end. If a program sorts in
intervals it should pick an optimal array size (32, 128, 512, etc) to do so.
To minimalize the impact the remainder of the array is sorted using a tail
merge.
The main advantage of a tail merge is that it allows reducing the swap
space of quadsort to n / 2 and that the galloping merge strategy works best
on arrays of different lengths. It also greatly simplifies the ping-pong
quad merge routine which only needs to work on arrays of equal length.
Rotate merge
By using rotations the swap space of quadsort is reduced further from n / 2
to n / 4. Rotations can be performed with minimal performance loss by using
monobound binary searches and trinity / bridge rotations.
Big O
┌───────────────────────┐┌───────────────────────┐ │comparisons ││swap memory │ ┌───────────────┐├───────┬───────┬───────┤├───────┬───────┬───────┤┌──────┐┌─────────┐┌─────────┐ │name ││min │avg │max ││min │avg │max ││stable││partition││adaptive │ ├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤ │mergesort ││n log n│n log n│n log n││n │n │n ││yes ││no ││no │ ├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤ │quadsort ││n │n log n│n log n││1 │n │n ││yes ││no ││yes │ ├───────────────┤├───────┼───────┼───────┤├───────┼───────┼───────┤├──────┤├─────────┤├─────────┤ │quicksort ││n log n│n log n│n² ││1 │1 │1 ││no ││yes ││no │ └───────────────┘└───────┴───────┴───────┘└───────┴───────┴───────┘└──────┘└─────────┘└─────────┘
Quadsort makes n comparisons when the data is fully sorted or reverse sorted.
Data Types
The C implementation of quadsort supports long doubles and 8, 16, 32, and 64 bit data types. By using pointers it's possible to sort any other data type, like strings.
Interface
Quadsort uses the same interface as qsort, which is described in man qsort.
In addition to supporting (l - r)
and ((l > r) - (l < r))
for the comparison function, (l > r)
is valid as well. Special note should be taken that C++ sorts use (l < r)
for the comparison function, which is incompatible with the C standard. When porting quadsort to C++ or Rust, switch (l, r)
to (r, l)
for every comparison.
Memory
By default quadsort uses between n and n / 4 swap memory. If memory allocation fails quadsort will switch to sorting in-place through rotations. The minimum memory requirement is 32 elements of stack memory.
Performance
Quadsort is one of the fastest merge sorts written to date. It is faster than quicksort for most data distributions, with the notable exception of generic data.
On arrays exceeding the L1 cache quicksort has an advantage due to its ability to partition. For small arrays quadsort has a significant advantage due to quicksort's inability to cost effectively pick a reliable pivot.
To take full advantage of branchless operations the cmp macro needs to be uncommented in bench.c, which will increase the performance by 30% on primitive types.
Variants
-
blitsort is a hybrid stable in-place rotate quicksort / quadsort.
-
crumsort is a hybrid unstable in-place quicksort / quadsort.
-
fluxsort is a hybrid stable quicksort / quadsort.
-
gridsort is a hybrid stable cubesort / quadsort. Gridsort makes O(n) moves rather than the typical O(n log n) moves. It is an online sort and might be of interest to those interested in data structures and sorting very large arrays.
-
twinsort is a simplified quadsort with a
much smaller code size. Twinsort might be of use to people who want to port or understand quadsort; it does not use
pointers or gotos. -
wolfsort is a hybrid stable radixsort / fluxsort with improved performance on random data. It's mostly a proof of concept that only works on unsigned 32 bit integers.
-
Robin Hood Sort is a hybrid stable radixsort / dropsort with improved performance on random and generic data. It has a compilation option to use quadsort for its merging.
Credits
I personally invented the quad swap, quad galloping merge, parity merge, branchless parity merge,
monobound binary search, bridge rotation, and trinity rotation.
The ping-pong quad merge had been independently implemented in wikisort prior to quadsort, and
likely by others as well.
The monobound binary search has been independently implemented, often referred to as a branchless binary search.
Special kudos to Musiccombo and Co for getting me interested in rotations and branchless logic.
Visualization
In the visualization below nine tests are performed on 256 elements.
- Random order
- Ascending order
- Ascending Saw
- Generic random order
- Descending order
- Descending Saw
- Random tail
- Random half
- Ascending tiles.
The upper half shows the swap memory and the bottom half shows the main memory.
Colors are used to differentiate various operations. Quad swaps are in cyan, reversals in magenta, skips in green, parity merges in orange, bridge rotations in yellow, and trinity rotations are in violet.
The visualization is available on YouTube and there's also a YouTube video of a java port of quadsort in ArrayV on a wide variety of data distributions.
Benchmark: quadsort vs std::stable_sort vs timsort
The following benchmark was on WSL 2 gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
using the wolfsort benchmark.
The source code was compiled using g++ -O3 -w -fpermissive bench.c
. Stablesort is g++'s std:stablesort function. Each test was ran 100 times
on 100,000 elements. A table with the best and average time in seconds can be
=span>
Read More