Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest #8591

Merged
merged 14 commits into from
Aug 22, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Aug 2, 2024

Problem/Solution

TimelineWriter::put_batch is simply a loop over individual puts. Each put acquires and releases locks, and checks for potentially starting a new layer. Batching these is more efficient, but more importantly unlocks future changes where we can pre-build serialized buffers much earlier in the ingest process, potentially even on the safekeeper (imagine a future model where some variant of DatadirModification lives on the safekeeper).

Ensuring that the values in put_batch are written to one layer also enables a simplification upstream, where we no longer need to write values in LSN-order. This saves us a sort, but also simplifies follow-on refactors to DatadirModification: we can store metadata keys and data keys separately at that level without needing to zip them together in LSN order later.

Why?

In this PR, these changes are simplify optimizations, but they are motivated by evolving the ingest path in the direction of disentangling extracting DatadirModification from Timeline. It may not obvious how right now, but the general idea is that we'll end up with three phases of ingest:

  • A) Decode walrecords and build a datadirmodification with all the simple data contents already in a big serialized buffer ready to write to an ephemeral layer <-- this part can be pipelined and parallelized, and done on a safekeeper!
  • B) Let that datadirmodification see a Timeline, so that it can also generate all the metadata updates that require a read-modify-write of existing pages
  • C) Dump the results of B into an ephemeral layer.

Related: #8452

Caveats

Doing a big monolithic buffer of values to write to disk is ordinarily an anti-pattern: we prefer nice streaming I/O. However:

  • In future, when we do this first decode stage on the safekeeper, it would be inefficient to serialize a Vec of Value, and then later deserialize it just to add blob size headers while writing into the ephemeral layer format. The idea is that for bulk write data, we will serialize exactly once.
  • The monolithic buffer is a stepping stone to pipelining more of this: by seriailizing earlier (rather than at the final put_value), we will be able to parallelize the wal decoding and bulk serialization of data page writes.
  • The ephemeral layer's buffered writer already stalls writes while it waits to flush: so while yes we'll stall for a couple milliseconds to write a couple megabytes, we already have stalls like this, just distributed across smaller writes.

Benchmarks

This PR is primarily a stepping stone to safekeeper ingest filtering, but also provides a modest efficiency improvement to the wal_recovery part of test_bulk_ingest.

test_bulk_ingest:

test_bulk_insert[neon-release-pg16].insert: 23.659 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 626 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s
test_bulk_insert[neon-release-pg16].compaction: 0.055 s

vs. tip of main:
test_bulk_insert[neon-release-pg16].insert: 24.001 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 604 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s
test_bulk_insert[neon-release-pg16].compaction: 0.054 s

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Aug 2, 2024
Copy link

github-actions bot commented Aug 2, 2024

2198 tests run: 2134 passed, 0 failed, 64 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_hot_standby_gc[True]: release
  • test_ondemand_wal_download_in_replication_slot_funcs: release

Code coverage* (full report)

  • functions: 32.4% (7241 of 22331 functions)
  • lines: 50.4% (58546 of 116115 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
67ef0e4 at 2024-08-22T10:14:09.856Z :recycle:

@jcsp jcsp changed the title pageserver: avoid a spurious sort during ingest pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest Aug 2, 2024
@jcsp jcsp force-pushed the jcsp/ingest-refactor-pt0 branch from 933cf8f to 30bae03 Compare August 2, 2024 18:12
@jcsp jcsp force-pushed the jcsp/ingest-refactor-pt0 branch 3 times, most recently from d559ee2 to cb393b1 Compare August 6, 2024 14:58
@jcsp jcsp marked this pull request as ready for review August 14, 2024 09:57
@jcsp jcsp requested a review from a team as a code owner August 14, 2024 09:57
@jcsp jcsp requested review from skyzh, problame and VladLazar and removed request for skyzh August 14, 2024 09:57
@jcsp jcsp force-pushed the jcsp/ingest-refactor-pt0 branch from cb393b1 to 1c0264d Compare August 14, 2024 10:41
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
@jcsp jcsp requested a review from VladLazar August 15, 2024 19:20
@jcsp
Copy link
Collaborator Author

jcsp commented Aug 16, 2024

This should be good to go, but I plan on merging it after Monday's release branch so that it gets a week in staging.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - just nits

pageserver/src/pgdatadir_mapping.rs Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
@jcsp jcsp enabled auto-merge (squash) August 22, 2024 09:45
@jcsp jcsp merged commit 7c74112 into main Aug 22, 2024
63 checks passed
@jcsp jcsp deleted the jcsp/ingest-refactor-pt0 branch August 22, 2024 10:04
jcsp added a commit that referenced this pull request Sep 3, 2024
…8621)

## Problem

Currently, DatadirModification keeps a key-indexed map of all pending
writes, even though we (almost) never need to read back dirty pages for
anything other than metadata pages (e.g. relation sizes).

Related: #6345

## Summary of changes

- commit() modifications before ingesting database creation wal records,
so that they are guaranteed to be able to get() everything they need
directly from the underlying Timeline.
- Split dirty pages in DatadirModification into pending_metadata_pages
and pending_data_pages. The data ones don't need to be in a
key-addressable format, so they just go in a Vec instead.
- Special case handling of zero-page writes in DatadirModification,
putting them in a map which is flushed on the end of a WAL record. This
handles the case where during ingest, we might first write a zero page,
and then ingest a postgres write to that page. We used to do this via
the key-indexed map of writes, but in this PR we change the data page
write path to not bother indexing these by key.

My least favorite thing about this PR is that I needed to change the
DatadirModification interface to add the on_record_end call. This is not
very invasive because there's really only one place we use it, but it
changes the object's behaviour from being clearly an aggregation of many
records to having some per-record state. I could avoid this by
implicitly doing the work when someone calls set_lsn or commit -- I'm
open to opinions on whether that's cleaner or dirtier.

## Performance

There may be some efficiency improvement here, but the primary
motivation is to enable an earlier stage of ingest to operate without
access to a Timeline. The `pending_data_pages` part is the "fast path"
bulk write data that can in principle be generated without a Timeline,
in parallel with other ingest batches, and ultimately on the safekeeper.

`test_bulk_insert` on AX102 shows approximately the same results as in
the previous PR #8591:

```
------------------------------ Benchmark results -------------------------------
test_bulk_insert[neon-release-pg16].insert: 23.577 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 637 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s
test_bulk_insert[neon-release-pg16].compaction: 0.052 s
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants