Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DiskBtreeReader::{visit, get} async #4863

Merged
merged 3 commits into from
Aug 3, 2023

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Aug 1, 2023

Problem

DiskBtreeReader::get and DiskBtreeReader::visit both call read_blk internally, which we would like to make async in the future. This PR focuses on making the interface of these two functions async. There is further work to be done in forms of making visit to not be recursive any more, similar to #4838. For that, see #4884.

Builds on top of #4839, part of #4743

Summary of changes

Make DiskBtreeReader::get and DiskBtreeReader::visit async functions and await in the places that call these functions.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)


Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-emptive approval since this is about changing the API, not the impl (yet).

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch 2 times, most recently from ce6b04c to 9a906f6 Compare August 2, 2023 23:40
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from dc9889b to 10d2740 Compare August 3, 2023 12:15
Base automatically changed from arpad/pageserver_io_async_kmerge_slice to main August 3, 2023 13:30
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch from 9a906f6 to 7f96bb3 Compare August 3, 2023 13:39
@arpad-m arpad-m marked this pull request as ready for review August 3, 2023 13:44
@arpad-m arpad-m requested review from a team as code owners August 3, 2023 13:44
@arpad-m arpad-m requested review from knizhnik and shanyp and removed request for a team August 3, 2023 13:44
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_btree_visit branch from 7f96bb3 to 4106397 Compare August 3, 2023 14:16
@arpad-m arpad-m merged commit a241c8b into main Aug 3, 2023
@arpad-m arpad-m deleted the arpad/pageserver_io_async_btree_visit branch August 3, 2023 15:36
arpad-m added a commit that referenced this pull request Aug 10, 2023
## Problem

The `DiskBtreeReader::visit` function calls `read_blk` internally, and
while #4863 converted the API of `visit` to async, the internal function
is still recursive. So, analogously to #4838, we turn the recursive
function into an iterative one.

## Summary of changes

First, we prepare the change by moving the for loop outside of the case
switch, so that we only have one loop that calls recursion. Then, we
switch from using recursion to an approach where we store the search
path inside the tree on a stack on the heap.

The caller of the `visit` function can control when the search over the
B-Tree ends, by returning `false` from the closure. This is often used
to either only find one specific entry (by always returning `false`),
but it is also used to iterate over all entries of the B-tree (by always
returning `true`), or to look for ranges (mostly in tests, but
`get_value_reconstruct_data` also has such a use).

Each stack entry contains two things: the block number (aka the block's
offset), and a children iterator. The children iterator is constructed
depending on the search direction, and with the results of a binary
search over node's children list. It is the only thing that survives a
spilling/push to the stack, everything else is reconstructed. In other
words, each stack spill, will, if the search is still ongoing, cause an
entire re-parsing of the node. Theoretically, this would be a linear
overhead in the number of leaves the search visits. However, one needs
to note:

* the workloads to look for a specific entry are just visiting one leaf,
ever, so this is mostly about workloads that visit larger ranges,
including ones that visit the entire B-tree.
* the requests first hit the page cache, so often the cost is just in
terms of node deserialization
* for nodes that only have leaf nodes as children, no spilling to the
stack-on-heap happens (outside of the initial request where the iterator
is `None`). In other words, for balanced trees, the spilling overhead is
$\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor
and `n` is the number of nodes in the tree. The B-Trees in the current
implementation have a branching factor of roughly `PAGE_SZ/L` where
`PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18`
in production code, so this gives us an estimate that we'd be re-loading
an inner node for every 99000 leaves in the B-tree in the worst case.

Due to these points above, I'd say that not fully caching the inner
nodes with inner children is reasonable, especially as we also want to
be fast for the "find one specific entry" workloads, where the stack
content is never accessed: any action to make the spilling
computationally more complex would contribute to wasted cycles here,
even if these workloads "only" spill one node for each depth level of
the b-tree (which is practically always a low single-digit number,
Kleppmann points out on page 81 that for branching factor 500, a four
level B-tree with 4 KB pages can store 250 TB of data).

But disclaimer, this is all stuff I thought about in my head, I have not
confirmed it with any benchmarks or data.

Builds on top of #4863, part of #4743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants