You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:
template <class ForwardIt, class T, class Compare>
constexpr ForwardIt sb_lower_bound(
ForwardIt first, ForwardIt last, const T& value, Compare comp) {
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
}
Same function interface as std::lower_bound
, but 2x faster, and shorter. “branchless” because the if
compiles down to a conditional move instruction rather than a branch/conditional jump. We will explore compiler options, even faster versions, fully branchless, and caveats towards the end of the article. There’s a significant update too. Oh and I’m sorry about just thrusting C++ code on you. Rude, I know. You don’t really need to know C++ to understand this article, just iterators (first
and last
), basically pointers to elements in an array, though note they can point one past the last array entry. Ignore template
, class
, constexpr
and &
. If only there was a clean fast bare-metal language to write all this in.. 1 2
Binary search intro
You have a sorted list and want to find where a value
would fit. In C++ you’d use std::lower_bound
which returns the first position (iterator) for which the comparison (usually <
) fails, or last
if the comparison is true for all elements. Let’s write that, so surprise coding interview question! (In C++. Good luck.)
The binary
in binary search comes from splitting up the list into two at some middle item and doing a comparison of the middle against the given value
. Based on the comparison result we pick which of the two lists to keep looking in. We start with a list of some length = last - first
that starts at the given iterator first
. We need to have some loop that keeps going until the list is empty, i.e. while (length > 0)
. Pick the/a middle, at length / 2
do a comparison and update the current list, which we’ll do by updating first
and length
. Here goes:
// std_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
if (comp(first[half], value)) {
first += half + 1;
length -= half + 1;
} else {
length = half;
}
}
return first;
That was straight-forward. What we got was a slightly refactored version of what gcc/libstdc++ uses or what llvm/libc++ uses. Roughly the same speed, even (sometimes) compiles down to the same assembly in the loop.
Branch prediction
“What’s slow about this?” Not much but glad you asked, great question. Processors have gotten faster and faster over the years, and part of the reason why is branch prediction. Short explanation: for speed, the CPU attempts to execute instructions in parallel by dividing each instruction execution into multiple stages, say F-D-E-W (fetch, decode, execute, writeback). With a careful design it can make progress on multiple stages of different instructions (Example: instruction 5 in stage F, 6 in D, 7E, 8F) at the same time. The complication comes from branches in the code, i.e. conditional jump instructions, where depending on some result the next instruction is either X or Y. The CPU predicts one option, X, and starts running through the stages, eventually including the stages of say X+1 and X+2. When the result becomes available and it turns out it should have been Y, all the work on X, X+1 and X+2 is thrown away. Branch mispredictions are expensive because the CPU could have already made progress on Y, Y+1 and Y+2 instead.
Branch prediction is great, but not for binary search. It’s a search, and you only search for something if you don’t know exactly where it is, otherwise you’d just get it. Which means that if (comp(first[half], value))
is unpredictable in common use of std::lower_bound
.
Credit @practicemakespink
Let’s help the processor.
-We:
// bstd_lower_bound()
auto length = last - first;
while (length > 0) {
auto half = length / 2;
bool compare = comp(first[half], value);
// No more ifs
first = compare ? first + half + 1 : first;
length = compare ? length - half - 1 : half;
}
return first;
-Clang compiler: “That’s not how this works!”
-We: -mllvm -x86-cmov-converter=false
-Clang compiler: “Yes, boss."3
The result is 25% faster as it uses 2 conditional move instructions. Not bad. But turns out that -mllvm -x86-cmov-converter=false
, which we’ll shorten to -cmov
, speeds up std::lower_bound
just as much because clang is smart enough to figure out how to convert the if
/else
to conditional moves. gcc doesn’t have such an option and generally just doesn’t care about what you want.
Overall, there’s currently no good way to tell either clang4 or gcc5 to use a conditional move in just a certain situation.I’m trying to not make this article about finicky compilers, so let’s move on.
What started this
Why are we even talking about speeding up binary searches anyway? Why am I roping you into this? Cause someone else roped me into this.
Rick and Morty - Meeseeks and Destroy
I saw Malte Skarupke’s translation of “Shar’s algorithm” into a C++ binary search (branchless_lower_bound
) and I couldn’t help but think “it’s not optimal”. Malte’s version sometimes compares value
against the same element multiple times. So I wondered, what is the optimal ‘branchless’ binary search? That led to sb_lower_bound
which is ~20% faster than Malte’s version of lower_bound that he tested as 2x faster than GCC.
“What’s an ‘optimal’ binary search anyway?” Good question. I think a binary search is ‘optimal’ if it completes by doing the minimum number of comparisons. This is very useful when you have a (relatively) slow comparison function. Malte noted his version is slower than std::lower_bound
for binary searching a large number of strings.
Looking at std::lower_bound
it returns an iterator, which can point to any of the list elements but also one past the end. For a list of size n
there are n+1
possible options. Thus for a list of size 2k-1
there are 2k
possible options. In this case the optimal number of comparisons is k
. Provably so as being able to distinguish between all 2k
options requires k
bits of information, and each comparison gives us 1 bit of information (true vs false). Translating this case into code, we get:
// Really fast but only works when length is a power of 2 minus 1
// When not, it can result in out of bounds accesses like for
// array [0, 1, 2, 3, 4, 5] and value 4
auto length = last - first;
while (length > 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
With clang -cmov
the loop compiles down to 6 instructions, one of which is cmov
. The reason this (and Malte’s code) is so fast is that only updating first
depends on the comparison result.
sb_lower_bound
Now let’s look at sb_lower_bound
(named after simple branchless). It actually took me longer to stumble upon than faster versions (yet to be presented) as it’s not ‘optimal’.
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
if (comp(first[length], value)) {
first += length + rem;
}
}
return first;
Every time we split the list, the number of elements happens to be even, and comp returns true then we don’t skip over enough elements. For n
elements, where 2k <= n < 2k+1
, sb_lower_bound
will always make k+1
comparisons while std::lower_bound
will either make k
or k+1
comparisons. On average std::lower_bound
will make about log2(n+1)
comparisons. Overall sb_lower_bound
is faster as it has significantly fewer instructions in the loop. The comparison function has to be really slow for the difference between k+1
and log2(n+1)
number of comparisons to matter.
Second caveat is that currently, gcc does not emit branchless code for sb_lower_bound
regardless of optimization level. It doesn’t emit branchless code for std::lower_bound
either so they end up about as fast. We can try to write some inline assembly to force gcc to use cmov
but there’s a tradeoff. The simple way results in more instructions than necessary. The alternative requires writing different assembly code for every possible type of value
(int, float, etc.).
// asm_lower_bound, works for x86 only
auto length = last - first;
while (length > 0) {
auto rem = length % 2;
length /= 2;
auto firstplus = first + length + rem;
// Does a comparison which sets some x86 FLAGS like CF or ZF
bool compare = comp(first[length], value);
// Inline assembly doesn't support passing bools straight into FLAGS
// so we ask the compiler to copy it from FLAGS into a register
__asm__(
// Then we test the register, which sets ZF=!compare and CF=0
// Reference: https://www.felixcloutier.com/x86/test
"test %[compare],%[compare]n"
// cmova copies firstplus into first if ZF=0 and CF=0
// Reference: https://www.felixcloutier.com/x86/cmovv
"cmova %[firstplus],%[first]n"
: [first] "+r"(first)
: [firstplus] "r"(firstplus), [compare]"r"(compare)
);
}
return first;
2x faster than the gcc version of std::lower_bound
, but slightly slower than sb_lower_bound
with clang -cmov
. Presented just for demonstration purposes.
More optimal
I promised an even faster version in the intro. Here it is:
// bb_lower_bound
auto length = last - first;
// while length isn't a power of 2 minus 1
while (length & (length + 1)) {
auto step = length / 8 * 6 + 1; // MAGIC
if (comp(first[step], value)) {
first += step + 1;
length -= step + 1;
} else {
length = step;
break;
}
}
while (length != 0) {
length /= 2;
if (comp(first[length], value)) {
first += length + 1;
}
}
return first;
Let’s quickly go over length & (length + 1)
. Example: length
is 110011
, length+1
is 110100
, length & (length+1)
is 110000
. Note how they always share their most significant 1
except for when length+1
carries over all set bits. That case is when
=sup>
Read More