Make DiskBtreeReader::{visit, get} async #4863

arpad-m · 2023-08-01T13:14:48Z

Problem

DiskBtreeReader::get and DiskBtreeReader::visit both call read_blk internally, which we would like to make async in the future. This PR focuses on making the interface of these two functions async. There is further work to be done in forms of making visit to not be recursive any more, similar to #4838. For that, see #4884.

Builds on top of #4839, part of #4743

Summary of changes

Make DiskBtreeReader::get and DiskBtreeReader::visit async functions and await in the places that call these functions.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-08-01T13:19:23Z

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)

koivunej

Pre-emptive approval since this is about changing the API, not the impl (yet).

## Problem The `DiskBtreeReader::visit` function calls `read_blk` internally, and while #4863 converted the API of `visit` to async, the internal function is still recursive. So, analogously to #4838, we turn the recursive function into an iterative one. ## Summary of changes First, we prepare the change by moving the for loop outside of the case switch, so that we only have one loop that calls recursion. Then, we switch from using recursion to an approach where we store the search path inside the tree on a stack on the heap. The caller of the `visit` function can control when the search over the B-Tree ends, by returning `false` from the closure. This is often used to either only find one specific entry (by always returning `false`), but it is also used to iterate over all entries of the B-tree (by always returning `true`), or to look for ranges (mostly in tests, but `get_value_reconstruct_data` also has such a use). Each stack entry contains two things: the block number (aka the block's offset), and a children iterator. The children iterator is constructed depending on the search direction, and with the results of a binary search over node's children list. It is the only thing that survives a spilling/push to the stack, everything else is reconstructed. In other words, each stack spill, will, if the search is still ongoing, cause an entire re-parsing of the node. Theoretically, this would be a linear overhead in the number of leaves the search visits. However, one needs to note: * the workloads to look for a specific entry are just visiting one leaf, ever, so this is mostly about workloads that visit larger ranges, including ones that visit the entire B-tree. * the requests first hit the page cache, so often the cost is just in terms of node deserialization * for nodes that only have leaf nodes as children, no spilling to the stack-on-heap happens (outside of the initial request where the iterator is `None`). In other words, for balanced trees, the spilling overhead is $\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor and `n` is the number of nodes in the tree. The B-Trees in the current implementation have a branching factor of roughly `PAGE_SZ/L` where `PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18` in production code, so this gives us an estimate that we'd be re-loading an inner node for every 99000 leaves in the B-tree in the worst case. Due to these points above, I'd say that not fully caching the inner nodes with inner children is reasonable, especially as we also want to be fast for the "find one specific entry" workloads, where the stack content is never accessed: any action to make the spilling computationally more complex would contribute to wasted cycles here, even if these workloads "only" spill one node for each depth level of the b-tree (which is practically always a low single-digit number, Kleppmann points out on page 81 that for branching factor 500, a four level B-tree with 4 KB pages can store 250 TB of data). But disclaimer, this is all stuff I thought about in my head, I have not confirmed it with any benchmarks or data. Builds on top of #4863, part of #4743

koivunej approved these changes Aug 1, 2023

View reviewed changes

arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch 2 times, most recently from ce6b04c to 9a906f6 Compare August 2, 2023 23:40

arpad-m mentioned this pull request Aug 3, 2023

Make the implementation of DiskBtreeReader::visit non-recursive #4884

Merged

5 tasks

arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from dc9889b to 10d2740 Compare August 3, 2023 12:15

Base automatically changed from arpad/pageserver_io_async_kmerge_slice to main August 3, 2023 13:30

Make DiskBtreeReader::get async

a9efa97

arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch from 9a906f6 to 7f96bb3 Compare August 3, 2023 13:39

arpad-m marked this pull request as ready for review August 3, 2023 13:44

arpad-m requested review from a team as code owners August 3, 2023 13:44

arpad-m requested review from knizhnik and shanyp and removed request for a team August 3, 2023 13:44

arpad-m added 2 commits August 3, 2023 16:16

Make DiskBtreeReader::visit async

e7b9237

Merge search_node and search_recurse functions

4106397

arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch from 7f96bb3 to 4106397 Compare August 3, 2023 14:16

arpad-m merged commit a241c8b into main Aug 3, 2023

arpad-m deleted the arpad/pageserver_io_async_btree_visit branch August 3, 2023 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DiskBtreeReader::{visit, get} async #4863

Make DiskBtreeReader::{visit, get} async #4863

arpad-m commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2023 •

edited

Loading

koivunej left a comment

Make DiskBtreeReader::{visit, get} async #4863

Make DiskBtreeReader::{visit, get} async #4863

Conversation

arpad-m commented Aug 1, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Aug 1, 2023 • edited Loading

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)

koivunej left a comment

Choose a reason for hiding this comment

arpad-m commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2023 •

edited

Loading