Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for #5004: microbenchmark & improvement #5011

Conversation

problame
Copy link
Contributor

As asked-for in #5004 (comment) this PR adds a throw-away microbenchmark and implements one improvement exposed by the microbenchmarks.

Read the two commits individually!

(This PR is on top of #5004 , I'll throw away the microbenchmarks and cherry-pick the improvements).

Results:

cs@devvm:[~/src/neon-work-2]: rm -r target/criterion/
cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob
   Compiling utils v0.1.0 (/home/cs/src/neon-work-2/libs/utils)
    Finished bench [optimized + debuginfo] target(s) in 2.50s
     Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a)
Gnuplot not found, using plotters backend
write_blob/old          time:   [18.001 ns 18.229 ns 18.487 ns]
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
write_blob/pr-5004-writer
                        time:   [38.320 ns 38.798 ns 39.291 ns]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
write_blob/pr-4994-nopagecache
                        time:   [6.2369 ns 6.3093 ns 6.3926 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
Before, each push_bytes would call get_buf_for_write.
By memoizing it, we bring the number of get_buf_for_write calls to the
same number as before this PR.

Benchmarks results show the improvement (write_blob/pr-5004-writer) but
there's still some overhead over the pre-PR situation (write_blob/old).

I don't know where it's from, but I'm ok with that overhead (17.3ns vs 25.6ns).

cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob
   Compiling utils v0.1.0 (/home/cs/src/neon-work-2/libs/utils)
    Finished bench [optimized + debuginfo] target(s) in 2.43s
     Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a)
Gnuplot not found, using plotters backend
write_blob/old          time:   [17.335 ns 17.584 ns 17.853 ns]
                        change: [-4.3206% -1.9089% +0.7827%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  5 (5.00%) high mild
  5 (5.00%) high severe
write_blob/pr-5004-writer
                        time:   [25.637 ns 25.959 ns 26.361 ns]
                        change: [-34.513% -32.589% -30.577%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
write_blob/pr-4994-nopagecache
                        time:   [6.2263 ns 6.2959 ns 6.3760 ns]
                        change: [-2.1835% -0.3094% +1.6064%] (p = 0.76 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
@github-actions
Copy link

github-actions bot commented Aug 16, 2023

1588 tests run: 1512 passed, 1 failed, 75 skipped (full report)


Failures on Posgres 14

  • test_get_tenant_size_with_multiple_branches: debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_get_tenant_size_with_multiple_branches[debug-pg14]"
The comment gets automatically updated with the latest test results
46a80a8 at 2023-08-17T08:16:48.325Z :recycle:

@hlinnaka
Copy link
Contributor

Am I reading this correctly, that #4994 will make this much faster in any case? Doesn't matter much what we do before that then, assuming we will merge that soonish.

shaves off a few nanoseconds in 5004, let's have it

cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob
    Finished bench [optimized + debuginfo] target(s) in 0.23s
     Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a)
Gnuplot not found, using plotters backend
write_blob/old          time:   [17.930 ns 18.222 ns 18.518 ns]
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
write_blob/pr-5004-writer
                        time:   [22.153 ns 22.538 ns 22.942 ns]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe
write_blob/pr-4994-nopagecache
                        time:   [6.2702 ns 6.3597 ns 6.4649 ns]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
with the previous commit, we now no longer need the Option

Shaves off another 2ns

cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob
    Finished bench [optimized + debuginfo] target(s) in 0.22s
     Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a)
Gnuplot not found, using plotters backend
write_blob/old          time:   [18.051 ns 18.301 ns 18.552 ns]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
write_blob/pr-5004-writer
                        time:   [19.766 ns 20.019 ns 20.314 ns]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
write_blob/pr-4994-nopagecache
                        time:   [6.3191 ns 6.3981 ns 6.4814 ns]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
@problame
Copy link
Contributor Author

problame commented Aug 16, 2023

Am I reading this correctly, that #4994 will make this much faster in any case? Doesn't matter much what we do before that then, assuming we will merge that soonish.

Yeah, one might argue though that the 4994 variant of the benchmark isn't completely comparable to the other two variants because it doesn't model the pre-warming of the page cache that I do in 4994

@problame
Copy link
Contributor Author

With the latest optimizations, we're at 18ns (old) vs 20ns (pr-5004).

It's not noise, I get that delta quite consistently.

But I'm ok with that hit for the improved readability / reduced clunkiness.

Everyone on board with that decision?

Please review the ephemeral_file.rs changes in this PR in their totality, then I'll merge them back and discard the microbenchmark.

@hlinnaka
Copy link
Contributor

+1. It's probably fast enough anyway in the grand scheme of things, and 4994 will change it again anyway.

Comment on lines 206 to 212
blknum: u32,
off: usize,
buf: MemoizedPageWriteGuard,
}
struct MemoizedPageWriteGuard {
guard: page_cache::PageWriteGuard<'static>,
blknum: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment to explain why we have two 'blknum' fields

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes in ephemeral_file.rs look good to me.

cs@devvm:[~/src/neon-work-2]: cargo bench -p utils --bench write_blob
    Finished bench [optimized + debuginfo] target(s) in 0.22s
     Running benches/write_blob.rs (target/release/deps/write_blob-f10f94bd2a09aa2a)
Gnuplot not found, using plotters backend
write_blob/old          time:   [17.859 ns 18.089 ns 18.279 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low severe
write_blob/pr-5004-writer
                        time:   [17.886 ns 18.040 ns 18.221 ns]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
write_blob/pr-4994-nopagecache
                        time:   [6.5882 ns 6.6204 ns 6.6529 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
@problame
Copy link
Contributor Author

forced inlining brings us on par with write_blob/old

Will merge with forced inlining.

@problame problame merged commit 46a80a8 into problame/ephemeral-write-blob-refactor Aug 17, 2023
@problame problame deleted the problame/ephemeral-write-blob-refactor--microbenchmark branch August 17, 2023 09:21
problame added a commit that referenced this pull request Aug 18, 2023
(This PR is the successor of #4984 )

## Summary

The current way in which `EphemeralFile` uses `PageCache` complicates
the Pageserver code base to a degree that isn't worth it.
This PR refactors how we cache `EphemeralFile` contents, by exploiting
the append-only nature of `EphemeralFile`.

The result is that `PageCache` only holds `ImmutableFilePage` and
`MaterializedPage`.
These types of pages are read-only and evictable without write-back.
This allows us to remove the writeback code from `PageCache`, also
eliminating an entire failure mode.

Futher, many great open-source libraries exist to solve the problem of a
read-only cache,
much better than our `page_cache.rs` (e.g., better replacement policy,
less global locking).
With this PR, we can now explore using them.

## Problem & Analysis

Before this PR, `PageCache` had three types of pages:

* `ImmutableFilePage`: caches Delta / Image layer file contents
* `MaterializedPage`: caches results of Timeline::get (page
materialization)
* `EphemeralPage`: caches `EphemeralFile` contents

`EphemeralPage` is quite different from `ImmutableFilePage` and
`MaterializedPage`:

* Immutable and materialized pages are for the acceleration of (future)
reads of the same data using `PAGE_CACHE_SIZE * PAGE_SIZE` bytes of
DRAM.
* Ephemeral pages are a write-back cache of `EphemeralFile` contents,
i.e., if there is pressure in the page cache, we spill `EphemeralFile`
contents to disk.

`EphemeralFile` is only used by `InMemoryLayer`, for the following
purposes:
* **write**: when filling up the `InMemoryLayer`, via `impl BlobWriter
for EphemeralFile`
* **read**: when doing **page reconstruction** for a page@lsn that isn't
written to disk
* **read**: when writing L0 layer files, we re-read the `InMemoryLayer`
and put the contents into the L0 delta writer
(**`create_delta_layer`**). This happens every 10min or when
InMemoryLayer reaches 256MB in size.

The access patterns of the `InMemoryLayer` use case are as follows:

* **write**: via `BlobWriter`, strictly append-only
* **read for page reconstruction**: via `BlobReader`, random
* **read for `create_delta_layer`**: via `BlobReader`, dependent on
data, but generally random. Why?
* in classical LSM terms, this function is what writes the
memory-resident `C0` tree into the disk-resident `C1` tree
* in our system, though, the values of InMemoryLayer are stored in an
EphemeralFile, and hence they are not guaranteed to be memory-resident
* the function reads `Value`s in `Key, LSN` order, which is `!=` insert
order

What do these `EphemeralFile`-level access patterns mean for the page
cache?

* **write**:
* the common case is that `Value` is a WAL record, and if it isn't a
full-page-image WAL record, then it's smaller than `PAGE_SIZE`
* So, the `EphemeralPage` pages act as a buffer for these `< PAGE_CACHE`
sized writes.
* If there's no page cache eviction between subsequent
`InMemoryLayer::put_value` calls, the `EphemeralPage` is still resident,
so the page cache avoids doing a `write` system call.
* In practice, a busy page server will have page cache evictions because
we only configure 64MB of page cache size.
* **reads for page reconstruction**: read acceleration, just as for the
other page types.
* **reads for `create_delta_layer`**:
* The `Value` reads happen through a `BlockCursor`, which optimizes the
case of repeated reads from the same page.
* So, the best case is that subsequent values are located on the same
page; hence `BlockCursor`s buffer is maximally effective.
* The worst case is that each `Value` is on a different page; hence the
`BlockCursor`'s 1-page-sized buffer is ineffective.
* The best case translates into `256MB/PAGE_SIZE` page cache accesses,
one per page.
    * the worst case translates into `#Values` page cache accesses
* again, the page cache accesses must be assumed to be random because
the `Value`s aren't accessed in insertion order but `Key, LSN` order.

## Summary of changes

Preliminaries for this PR were:

- #5003
- #5004 
- #5005 
  - uncommitted microbenchmark in #5011 

Based on the observations outlined above, this PR makes the following
changes:

* Rip out `EphemeralPage` from `page_cache.rs`
* Move the `block_io::FileId` to `page_cache::FileId`
* Add a `PAGE_SIZE`d buffer to the `EphemeralPage` struct.
  It's called `mutable_tail`.
* Change `write_blob` to use `mutable_tail` for the write buffering
instead of a page cache page.
* if `mutable_tail` is full, it writes it out to disk, zeroes it out,
and re-uses it.
* There is explicitly no double-buffering, so that memory allocation per
`EphemeralFile` instance is fixed.
* Change `read_blob` to return different `BlockLease` variants depending
on `blknum`
* for the `blknum` that corresponds to the `mutable_tail`, return a ref
to it
* Rust borrowing rules prevent `write_blob` calls while refs are
outstanding.
  * for all non-tail blocks, return a page-cached `ImmutablePage`
* It is safe to page-cache these as ImmutablePage because EphemeralFile
is append-only.

## Performance 

How doe the changes above affect performance?
M claim is: not significantly.

* **write path**:
* before this PR, the `EphemeralFile::write_blob` didn't issue its own
`write` system calls.
* If there were enough free pages, it didn't issue *any* `write` system
calls.
* If it had to evict other `EphemeralPage`s to get pages a page for its
writes (`get_buf_for_write`), the page cache code would implicitly issue
the writeback of victim pages as needed.
* With this PR, `EphemeralFile::write_blob` *always* issues *all* of its
*own* `write` system calls.
* Also, the writes are explicit instead of implicit through page cache
write back, which will help #4743
* The perf impact of always doing the writes is the CPU overhead and
syscall latency.
* Before this PR, we might have never issued them if there were enough
free pages.
* We don't issue `fsync` and can expect the writes to only hit the
kernel page cache.
* There is also an advantage in issuing the writes directly: the perf
impact is paid by the tenant that caused the writes, instead of whatever
tenant evicts the `EphemeralPage`.
* **reads for page reconstruction**: no impact.
* The `write_blob` function pre-warms the page cache when it writes the
`mutable_tail` to disk.
* So, the behavior is the same as with the EphemeralPages before this
PR.
* **reads for `create_delta_layer`**: no impact.
  * Same argument as for page reconstruction.
  * Note for the future:
* going through the page cache likely causes read amplification here.
Why?
* Due to the `Key,Lsn`-ordered access pattern, we don't read all the
values in the page before moving to the next page. In the worst case, we
might read the same page multiple times to read different `Values` from
it.
    * So, it might be better to bypass the page cache here.
    * Idea drafts:
      * bypass PS page cache + prefetch pipeline + iovec-based IO
* bypass PS page cache + use `copy_file_range` to copy from ephemeral
file into the L0 delta file, without going through user space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants