Fastest Branchless Binary Search by jorangreef

Share This Article

Sed ut perspiciatis unde.

You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:

template <class ForwardIt, class T, class Compare>
constexpr ForwardIt sb_lower_bound(
      ForwardIt first, ForwardIt last, const T& value, Compare comp) {
   auto length = last - first;
   while (length > 0) {
      auto rem = length % 2;
      length /= 2;
      if (comp(first[length], value)) {
         first += length + rem;
      }
   }
   return first;
}

Same function interface as std::lower_bound, but 2x faster, and shorter. “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump. We will explore compiler options, even faster versions, fully branchless, and caveats towards the end of the article. There’s a significant update too. Oh and I’m sorry about just thrusting C++ code on you. Rude, I know. You don’t really need to know C++ to understand this article, just iterators (first and last), basically pointers to elements in an array, though note they can point one past the last array entry. Ignore template, class, constexpr and &. If only there was a clean fast bare-metal language to write all this in.. ¹ ²

Binary search intro

You have a sorted list and want to find where a value would fit. In C++ you’d use std::lower_bound which returns the first position (iterator) for which the comparison (usually <) fails, or last if the comparison is true for all elements. Let’s write that, so surprise coding interview question! (In C++. Good luck.)

The binary in binary search comes from splitting up the list into two at some middle item and doing a comparison of the middle against the given value. Based on the comparison result we pick which of the two lists to keep looking in. We start with a list of some length = last - first that starts at the given iterator first. We need to have some loop that keeps going until the list is empty, i.e. while (length > 0). Pick the/a middle, at length / 2 do a comparison and update the current list, which we’ll do by updating first and length. Here goes:

// std_lower_bound()
auto length = last - first;
while (length > 0) {
   auto half = length / 2;
   if (comp(first[half], value)) {
      first += half + 1;
      length -= half + 1;
   } else {
      length = half;
   }
}
return first;

That was straight-forward. What we got was a slightly refactored version of what gcc/libstdc++ uses or what llvm/libc++ uses. Roughly the same speed, even (sometimes) compiles down to the same assembly in the loop.

Branch prediction

“What’s slow about this?” Not much but glad you asked, great question. Processors have gotten faster and faster over the years, and part of the reason why is branch prediction. Short explanation: for speed, the CPU attempts to execute instructions in parallel by dividing each instruction execution into multiple stages, say F-D-E-W (fetch, decode, execute, writeback). With a careful design it can make progress on multiple stages of different instructions (Example: instruction 5 in stage F, 6 in D, 7E, 8F) at the same time. The complication comes from branches in the code, i.e. conditional jump instructions, where depending on some result the next instruction is either X or Y. The CPU predicts one option, X, and starts running through the stages, eventually including the stages of say X+1 and X+2. When the result becomes available and it turns out it should have been Y, all the work on X, X+1 and X+2 is thrown away. Branch mispredictions are expensive because the CPU could have already made progress on Y, Y+1 and Y+2 instead.

Branch prediction is great, but not for binary search. It’s a search, and you only search for something if you don’t know exactly where it is, otherwise you’d just get it. Which means that if (comp(first[half], value)) is unpredictable in common use of std::lower_bound.

Credit @practicemakespink

Let’s help the processor.

-We:

// bstd_lower_bound()
auto length = last - first;
while (length > 0) {
   auto half = length / 2;
   bool compare = comp(first[half], value);
   // No more ifs
   first = compare ? first + half + 1 : first;
   length = compare ? length - half - 1 : half;
}
return first;

-Clang compiler: “That’s not how this works!”

-We: -mllvm -x86-cmov-converter=false

-Clang compiler: “Yes, boss."³

The result is 25% faster as it uses 2 conditional move instructions. Not bad. But turns out that -mllvm -x86-cmov-converter=false, which we’ll shorten to -cmov, speeds up std::lower_bound just as much because clang is smart enough to figure out how to convert the if/else to conditional moves. gcc doesn’t have such an option and generally just doesn’t care about what you want.

Overall, there’s currently no good way to tell either clang⁴ or gcc⁵ to use a conditional move in just a certain situation.I’m trying to not make this article about finicky compilers, so let’s move on.

What started this

Why are we even talking about speeding up binary searches anyway? Why am I roping you into this? Cause someone else roped me into this.

Rick and Morty - Meeseeks and Destroy

I saw Malte Skarupke’s translation of “Shar’s algorithm” into a C++ binary search (branchless_lower_bound) and I couldn’t help but think “it’s not optimal”. Malte’s version sometimes compares value against the same element multiple times. So I wondered, what is the optimal ‘branchless’ binary search? That led to sb_lower_bound which is ~20% faster than Malte’s version of lower_bound that he tested as 2x faster than GCC.

“What’s an ‘optimal’ binary search anyway?” Good question. I think a binary search is ‘optimal’ if it completes by doing the minimum number of comparisons. This is very useful when you have a (relatively) slow comparison function. Malte noted his version is slower than std::lower_bound for binary searching a large number of strings.

Looking at std::lower_bound it returns an iterator, which can point to any of the list elements but also one past the end. For a list of size n there are n+1 possible options. Thus for a list of size 2^k-1 there are 2^k possible options. In this case the optimal number of comparisons is k. Provably so as being able to distinguish between all 2^k options requires k bits of information, and each comparison gives us 1 bit of information (true vs false). Translating this case into code, we get:

// Really fast but only works when length is a power of 2 minus 1
// When not, it can result in out of bounds accesses like for
// array [0, 1, 2, 3, 4, 5] and value 4
auto length = last - first;
while (length > 0) {
   length /= 2;
   if (comp(first[length], value)) {
      first += length + 1;
   }
}
return first;

With clang -cmov the loop compiles down to 6 instructions, one of which is cmov. The reason this (and Malte’s code) is so fast is that only updating first depends on the comparison result.

`sb_lower_bound`

Now let’s look at sb_lower_bound (named after simple branchless). It actually took me longer to stumble upon than faster versions (yet to be presented) as it’s not ‘optimal’.

auto length = last - first;
while (length > 0) {
   auto rem = length % 2;
   length /= 2;
   if (comp(first[length], value)) {
      first += length + rem;
   }
}
return first;

Every time we split the list, the number of elements happens to be even, and comp returns true then we don’t skip over enough elements. For n elements, where 2^k <= n < 2^k+1, sb_lower_bound will always make k+1 comparisons while std::lower_bound will either make k or k+1 comparisons. On average std::lower_bound will make about log2(n+1) comparisons. Overall sb_lower_bound is faster as it has significantly fewer instructions in the loop. The comparison function has to be really slow for the difference between k+1 and log2(n+1) number of comparisons to matter.

Second caveat is that currently, gcc does not emit branchless code for sb_lower_bound regardless of optimization level. It doesn’t emit branchless code for std::lower_bound either so they end up about as fast. We can try to write some inline assembly to force gcc to use cmov but there’s a tradeoff. The simple way results in more instructions than necessary. The alternative requires writing different assembly code for every possible type of value (int, float, etc.).

// asm_lower_bound, works for x86 only
auto length = last - first;
while (length > 0) {
   auto rem = length % 2;
   length /= 2;
   auto firstplus = first + length + rem;
   // Does a comparison which sets some x86 FLAGS like CF or ZF
   bool compare = comp(first[length], value);
   // Inline assembly doesn't support passing bools straight into FLAGS
   // so we ask the compiler to copy it from FLAGS into a register
   __asm__(
         // Then we test the register, which sets ZF=!compare and CF=0
         // Reference: https://www.felixcloutier.com/x86/test
         "test %[compare],%[compare]n"
         // cmova copies firstplus into first if ZF=0 and CF=0
         // Reference: https://www.felixcloutier.com/x86/cmovv
         "cmova %[firstplus],%[first]n"
         : [first] "+r"(first)
         : [firstplus] "r"(firstplus), [compare]"r"(compare)
   );
}
return first;

2x faster than the gcc version of std::lower_bound, but slightly slower than sb_lower_bound with clang -cmov. Presented just for demonstration purposes.

More optimal

I promised an even faster version in the intro. Here it is:

// bb_lower_bound
auto length = last - first;
// while length isn't a power of 2 minus 1
while (length & (length + 1)) {
   auto step = length / 8 * 6 + 1; // MAGIC
   if (comp(first[step], value)) {
      first += step + 1;
      length -= step + 1;
   } else {
      length = step;
      break;
   }
}
while (length != 0) {
   length /= 2;
   if (comp(first[length], value)) {
      first += length + 1;
   }
}
return first;

Let’s quickly go over length & (length + 1). Example: length is 110011, length+1 is 110100, length & (length+1) is 110000. Note how they always share their most significant 1 except for when length+1 carries over all set bits. That case is when

Fastest Branchless Binary Search by jorangreef

Fastest Branchless Binary Search by jorangreef

Share This Article

Newsletter

Binary search intro

Branch prediction

What started this

`sb_lower_bound`

More optimal

HackTech

Leave a comment Cancel reply

Editor's Choice

Fastest Branchless Binary Search by jorangreef

Fastest Branchless Binary Search by jorangreef

Share This Article

Newsletter

Binary search intro

Branch prediction

What started this

sb_lower_bound

More optimal

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

`sb_lower_bound`