pageserver: call maybe_freeze_ephemeral_layer from a dedicated task #7594

jcsp · 2024-05-02T16:00:42Z

Problem

In testing of the earlier fix for OOMs under heavy write load (#7218), we saw that the limit on ephemeral layer size wasn't being reliably enforced. That was diagnosed as being due to overwhelmed compaction loops: most tenants were waiting on the semaphore for background tasks, and thereby not running the function that proactively rolls layers frequently enough.

Related: #6939

Summary of changes

Create a new per-tenant background loop for "ingest housekeeping", which invokes maybe_freeze_ephemeral_layer() without taking the background task semaphore.
Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had been INFO, but turns out to be pretty common in the field.

There's some discussion on the issue (#6939 (comment)) about alternatives for calling this maybe_freeze_epemeral_layer periodically without it getting stuck behind compaction. A whole task just for this feels like kind of a big hammer, but we may in future find that there are other pieces of lightweight housekeeping that we want to do here too.

Why is it okay to call maybe_freeze_ephemeral_layer outside of the background tasks semaphore?

this is the same work we would do anyway if we receive writes from the safekeeper, just done a bit sooner.
The period of the new task is generously jittered (+/- 5%), so when the ephemeral layer size tips over the threshold, we shouldn't see an excessively aggressive thundering herd of layer freezes (and only layers larger than the mean layer size will be frozen)
All that said, this is an imperfect approach that relies on having a generous amount of RAM to dip into when we need to freeze somewhat urgently. It would be nice in future to also block compaction/GC when we recognize resource stress and need to do other work (like layer freezing) to reduce memory footprint.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-05-02T16:40:30Z

2886 tests run: 2765 passed, 0 failed, 121 skipped (full report)

Flaky tests (1)

Postgres 15

test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

functions: 28.0% (6598 of 23554 functions)
lines: 46.7% (46856 of 100434 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
94ee8cb at 2024-05-03T14:13:17.872Z :recycle:}

…7594) ## Problem In testing of the earlier fix for OOMs under heavy write load (#7218), we saw that the limit on ephemeral layer size wasn't being reliably enforced. That was diagnosed as being due to overwhelmed compaction loops: most tenants were waiting on the semaphore for background tasks, and thereby not running the function that proactively rolls layers frequently enough. Related: #6939 ## Summary of changes - Create a new per-tenant background loop for "ingest housekeeping", which invokes maybe_freeze_ephemeral_layer() without taking the background task semaphore. - Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had been INFO, but turns out to be pretty common in the field. There's some discussion on the issue (#6939 (comment)) about alternatives for calling this maybe_freeze_epemeral_layer periodically without it getting stuck behind compaction. A whole task just for this feels like kind of a big hammer, but we may in future find that there are other pieces of lightweight housekeeping that we want to do here too. Why is it okay to call maybe_freeze_ephemeral_layer outside of the background tasks semaphore? - this is the same work we would do anyway if we receive writes from the safekeeper, just done a bit sooner. - The period of the new task is generously jittered (+/- 5%), so when the ephemeral layer size tips over the threshold, we shouldn't see an excessively aggressive thundering herd of layer freezes (and only layers larger than the mean layer size will be frozen) - All that said, this is an imperfect approach that relies on having a generous amount of RAM to dip into when we need to freeze somewhat urgently. It would be nice in future to also block compaction/GC when we recognize resource stress and need to do other work (like layer freezing) to reduce memory footprint.

jcsp added 2 commits May 2, 2024 16:55

pageserver: move test_sharding_split_failures to its own task

44c605d

pageserver: downgrade noisy log

e733ef0

jcsp mentioned this pull request May 2, 2024

OOMs in staging triggered by production-like pageserver workload benchmark #6939

Closed

Run ingest housekeeping even if compaction is disabled

94ee8cb

jcsp changed the title ~~pageserver: move test_sharding_split_failures to its own task~~ pageserver: call maybe_freeze_ephemeral_layer from a dedicated task May 3, 2024

jcsp marked this pull request as ready for review May 3, 2024 16:59

jcsp requested a review from a team as a code owner May 3, 2024 16:59

jcsp requested a review from arssher May 3, 2024 16:59

arssher approved these changes May 6, 2024

View reviewed changes

jcsp merged commit 3764dd2 into main May 6, 2024
55 checks passed

jcsp deleted the jcsp/issue-6939-background-loop branch May 6, 2024 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: call maybe_freeze_ephemeral_layer from a dedicated task #7594

pageserver: call maybe_freeze_ephemeral_layer from a dedicated task #7594

jcsp commented May 2, 2024 •

edited

Loading

github-actions bot commented May 2, 2024 •

edited

Loading

Postgres 15

pageserver: call maybe_freeze_ephemeral_layer from a dedicated task #7594

pageserver: call maybe_freeze_ephemeral_layer from a dedicated task #7594

Conversation

jcsp commented May 2, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented May 2, 2024 • edited Loading

2886 tests run: 2765 passed, 0 failed, 121 skipped (full report)

Postgres 15

Code coverage* (full report)

jcsp commented May 2, 2024 •

edited

Loading

github-actions bot commented May 2, 2024 •

edited

Loading