Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
btree: Reduce opportunities for branch mispredictions in binary search
We currently use a textbook binary search algorithm. This is known to suffer from branch misprediction penalties. The branch mispredictions can also clobber cache lines, which is detrimental to performance. I recently read about a branchless binary search algorithm published by Knuth called Shar's Algorithm (not to be confused with Shor's algorithm). It is well known to outperform the textbook binary search algorithm. It does an extra comparison. It is typically presented for power of 2 array sizes, and adapting it to support non-power of 2 array sizes is difficult to do in a way that is convincingly correct. Adapting it to fill out zfs_btree_index_t is even more complex. Therefore, I invented my own algorithm by refactoring the textbook algorithm using a few tricks: 1. x = (y < z) ? a : b is equivalent to x = a * (y < z) + b * (y >= z) 2. x = (y > z) ? a : b is equivalent to x = a * (y > z) + b * (y <= z) 3. The maximum number of iterations will be highbit(size), so we can iterate on that. 4. Ensuring that we get the same results means that we need to handle early matches. This means we must avoid changes to the values of comp and idx when comp is 0, which can do when comp is 0 by doing idx = !!comp * (min + max) / 2 + !comp * idx. This will make us repeat the previous comparison. 5. If we delete the equal to case from the equivalencies used in calculating min and max, we can cause them to be 0 when we have an early match. This allows us to drop !!comp, since 0 + 0) / 2 is 0. 6. There is still the matter of maintaining behavior when min >= max, where the original algorithm will exit the loop. We achieve this by modifying idx assignment to avoid changes to the value whenever (min >= max). We multiply the first term by (min < max) and replace !comp with (!comp | (min >= max)) so that the idx will remain unchanged whenever the original algorithm will terminate early. min will be allowed to increment under these conditions, but max will remain the same, such that the function will always return as if it were the original. The result is that we avoid both branch misprediction penalties in the loop and cache pollution by only accessing the memory locations that we need to access to perform a binary search. This comes at the expense of some additonal computations, but we are likely to stall waiting on memory accesses otherwise, so the additional computations should be effectively free. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
- Loading branch information