-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DiskBtreeReader::{visit, get} async #4863
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1264 tests run: 1213 passed, 0 failed, 51 skipped (full report) |
koivunej
approved these changes
Aug 1, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-emptive approval since this is about changing the API, not the impl (yet).
arpad-m
force-pushed
the
arpad/pageserver_io_async_btree_visit
branch
2 times, most recently
from
August 2, 2023 23:40
ce6b04c
to
9a906f6
Compare
5 tasks
arpad-m
force-pushed
the
arpad/pageserver_io_async_kmerge_slice
branch
from
August 3, 2023 12:15
dc9889b
to
10d2740
Compare
arpad-m
force-pushed
the
arpad/pageserver_io_async_btree_visit
branch
from
August 3, 2023 13:39
9a906f6
to
7f96bb3
Compare
arpad-m
requested review from
knizhnik and
shanyp
and removed request for
a team
August 3, 2023 13:44
arpad-m
force-pushed
the
arpad/pageserver_io_async_btree_visit
branch
from
August 3, 2023 14:16
7f96bb3
to
4106397
Compare
arpad-m
added a commit
that referenced
this pull request
Aug 10, 2023
## Problem The `DiskBtreeReader::visit` function calls `read_blk` internally, and while #4863 converted the API of `visit` to async, the internal function is still recursive. So, analogously to #4838, we turn the recursive function into an iterative one. ## Summary of changes First, we prepare the change by moving the for loop outside of the case switch, so that we only have one loop that calls recursion. Then, we switch from using recursion to an approach where we store the search path inside the tree on a stack on the heap. The caller of the `visit` function can control when the search over the B-Tree ends, by returning `false` from the closure. This is often used to either only find one specific entry (by always returning `false`), but it is also used to iterate over all entries of the B-tree (by always returning `true`), or to look for ranges (mostly in tests, but `get_value_reconstruct_data` also has such a use). Each stack entry contains two things: the block number (aka the block's offset), and a children iterator. The children iterator is constructed depending on the search direction, and with the results of a binary search over node's children list. It is the only thing that survives a spilling/push to the stack, everything else is reconstructed. In other words, each stack spill, will, if the search is still ongoing, cause an entire re-parsing of the node. Theoretically, this would be a linear overhead in the number of leaves the search visits. However, one needs to note: * the workloads to look for a specific entry are just visiting one leaf, ever, so this is mostly about workloads that visit larger ranges, including ones that visit the entire B-tree. * the requests first hit the page cache, so often the cost is just in terms of node deserialization * for nodes that only have leaf nodes as children, no spilling to the stack-on-heap happens (outside of the initial request where the iterator is `None`). In other words, for balanced trees, the spilling overhead is $\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor and `n` is the number of nodes in the tree. The B-Trees in the current implementation have a branching factor of roughly `PAGE_SZ/L` where `PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18` in production code, so this gives us an estimate that we'd be re-loading an inner node for every 99000 leaves in the B-tree in the worst case. Due to these points above, I'd say that not fully caching the inner nodes with inner children is reasonable, especially as we also want to be fast for the "find one specific entry" workloads, where the stack content is never accessed: any action to make the spilling computationally more complex would contribute to wasted cycles here, even if these workloads "only" spill one node for each depth level of the b-tree (which is practically always a low single-digit number, Kleppmann points out on page 81 that for branching factor 500, a four level B-tree with 4 KB pages can store 250 TB of data). But disclaimer, this is all stuff I thought about in my head, I have not confirmed it with any benchmarks or data. Builds on top of #4863, part of #4743
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
DiskBtreeReader::get
andDiskBtreeReader::visit
both callread_blk
internally, which we would like to make async in the future. This PR focuses on making the interface of these two functionsasync
. There is further work to be done in forms of makingvisit
to not be recursive any more, similar to #4838. For that, see #4884.Builds on top of #4839, part of #4743
Summary of changes
Make
DiskBtreeReader::get
andDiskBtreeReader::visit
async functions andawait
in the places that call these functions.Checklist before requesting a review
Checklist before merging