btree: Reduce opportunities for branch mispredictions in binary search

We currently use a textbook binary search algorithm. This is known to suffer from branch misprediction penalties. The branch mispredictions can also clobber cache lines, which is detrimental to performance. I recently read about a branchless binary search algorithm published by Knuth called Shar's Algorithm (not to be confused with Shor's algorithm). It is well known to outperform the textbook binary search algorithm. It does an extra comparison. It is typically presented for power of 2 array sizes, and adapting it to support non-power of 2 array sizes is difficult to do in a way that is convincingly correct. Adapting it to fill out zfs_btree_index_t is even more complex. Therefore, I invented my own algorithm by refactoring the textbook algorithm using a few tricks: 1. x = (y < z) ? a : b is equivalent to x = a * (y < z) + b * (y >= z) 2. x = (y > z) ? a : b is equivalent to x = a * (y > z) + b * (y <= z) 3. The maximum number of iterations will be highbit(size), so we can iterate on that. 4. Ensuring that we get the same results means that we need to handle early matches. This means we must avoid changes to the values of comp and idx when comp is 0, which can do when comp is 0 by doing idx = !!comp * (min + max) / 2 + !comp * idx. This will make us repeat the previous comparison. 5. If we delete the equal to case from the equivalencies used in calculating min and max, we can cause them to be 0 when we have an early match. This allows us to drop !!comp, since 0 + 0) / 2 is 0. 6. There is still the matter of maintaining behavior when min >= max, where the original algorithm will exit the loop. We achieve this by modifying idx assignment to avoid changes to the value whenever (min >= max). We multiply the first term by (min < max) and replace !comp with (!comp | (min >= max)) so that the idx will remain unchanged whenever the original algorithm will terminate early. min will be allowed to increment under these conditions, but max will remain the same, such that the function will always return as if it were the original. The result is that we avoid both branch misprediction penalties in the loop and cache pollution by only accessing the memory locations that we need to access to perform a binary search. This comes at the expense of some additonal computations, but we are likely to stall waiting on memory accesses otherwise, so the additional computations should be effectively free. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
openzfs · May 14, 2023 · feeb3f2 · feeb3f2
1 parent 7381ddf
commit feeb3f2
Showing 1 changed file with 21 additions and 14 deletions.
diff --git a/module/zfs/btree.c b/module/zfs/btree.c
@@ -216,27 +216,34 @@ zfs_btree_create_custom(zfs_btree_t *tree,
 }
 
 /*
- * Find value in the array of elements provided. Uses a simple binary search.
+ * Find value in the array of elements provided. Uses a "branchless" binary
+ * search derived by refactoring a simple binary search to avoid branch
+ * misprediction penalties by not branching within the loop.
  */
 static void *
 zfs_btree_find_in_buf(zfs_btree_t *tree, uint8_t *buf, uint32_t nelems,
     const void *value, zfs_btree_index_t *where)
 {
 	uint32_t max = nelems;
 	uint32_t min = 0;
-	while (max > min) {
-		uint32_t idx = (min + max) / 2;
-		uint8_t *cur = buf + idx * tree->bt_elem_size;
-		int comp = tree->bt_compar(cur, value);
-		if (comp < 0) {
-			min = idx + 1;
-		} else if (comp > 0) {
-			max = idx;
-		} else {
-			where->bti_offset = idx;
-			where->bti_before = B_FALSE;
-			return (cur);
-		}
+	uint32_t idx = 0;
+	uint8_t *cur;
+	uint32_t i = highbit64(nelems);
+	int comp = 1;
+
+	while (i--) {
+		idx = (min < max) * (min + max) / 2 +
+		    (!comp | (min >= max)) * idx;
+		cur = buf + idx * tree->bt_elem_size;
+		comp = tree->bt_compar(cur, value);
+		min = (idx + 1) * (comp < 0) + min * (comp > 0);
+		max = idx * (comp > 0) + max * (comp < 0);
+	}
+
+	if (comp == 0) {
+		where->bti_offset = idx;
+		where->bti_before = B_FALSE;
+		return (cur);
 	}
 
 	where->bti_offset = max;