-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: use a single tokio runtime #6555
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jan 31, 2024
Base automatically changed from
problame/integrate-tokio-epoll-uring/benchmarking/2024-01-31-prs/refactor-signals
to
main
January 31, 2024 23:25
problame
added a commit
that referenced
this pull request
Jan 31, 2024
This refactoring makes it easier to experimentally replace BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful [during benchmarking](#6555).
2718 tests run: 2581 passed, 0 failed, 137 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
a7fd969 at 2024-03-23T16:28:43.276Z :recycle: |
problame
force-pushed
the
problame/one-runtime
branch
from
March 22, 2024 18:14
ac88917
to
3426680
Compare
problame
changed the title
[DO NOT MERGE] only one tokio runtime in pageserver
only one tokio runtime in pageserver
Mar 22, 2024
problame
changed the title
only one tokio runtime in pageserver
only have one tokio runtime in pageserver
Mar 22, 2024
problame
force-pushed
the
problame/one-runtime
branch
from
March 22, 2024 18:56
3426680
to
9e2a805
Compare
(I suspect something about shutdown signalling is the root cause for the failure of test_lazy_attach_activation in CI which I can't reproduce locally)
This reverts commit b546031.
problame
force-pushed
the
problame/one-runtime
branch
from
March 23, 2024 13:50
02f851f
to
945f3b5
Compare
Before this patch, we would leave the `timeline-calculate-logical-size-pause` failpoint in `pause` mode at the end of the test. With the switch to a single runtime, somehow we'd end up in a place where the pageserver was half shut down while the failpoint spawn_blocking thread was waiting for the `off` event that never arrived. Failures were reproducible quite well in CI: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6555/8396322235/index.html#suites/c39429f093f87547b2a3b0943e2522d9/4dacb1efb232b98/ I couldn't repro it locally. I managed to repro it once on an i3en.3xlarge , where I then attached gdb to capture the backtrace. For posterity: https://www.notion.so/neondatabase/debug-test_lazy_attach_activation-teardown-hang-as-part-of-PR-6555-421cb61dc45d4d4e90220c86567f50da?pvs=4
This reverts commit 53e9173.
Possibly the last commit was a red herring (still a good cleanup though). However, this one here is a definite executor stall at the failpoint. And it makes sense that it would cause the shutdown timeouts with the unified runtime, because the walreceiver code previously ran on a separate runtime.
Methodology: 1. search for `"pause"` and fix as appropriate 2. search for `=pause` => no further fixes found Future work: audit all uses of fail::fail_point! / auto-convert them all to use pausable_failpoint! (or implement our own failpoints library)
problame
force-pushed
the
problame/one-runtime
branch
from
March 23, 2024 14:00
945f3b5
to
6dca7c6
Compare
…ional" This reverts commit 0fccba2.
problame
changed the title
only have one tokio runtime in pageserver
only use a single tokio runtime in pageserver
Mar 23, 2024
problame
changed the title
only use a single tokio runtime in pageserver
pageserver: use a single tokio runtime
Mar 23, 2024
problame
requested review from
VladLazar and
koivunej
and removed request for
VladLazar
March 23, 2024 16:44
koivunej
approved these changes
Mar 23, 2024
Nightly pgbench runs in staging look unchanged 👍 |
problame
added a commit
that referenced
this pull request
Mar 26, 2024
The nightly benchmarks looked unchanged but the production-like cloudbench in eu-west-1 is showing a bunch of not-yet-fully-understood symptoms: https://neondb.slack.com/archives/C06K38EB05D/p1711447656987039?thread_ts=1711445451.915969&cid=C06K38EB05D So, revert the commit and ship a release without it. This reverts commit 3220f83.
problame
added a commit
that referenced
this pull request
Mar 26, 2024
problame
added a commit
that referenced
this pull request
Apr 8, 2024
This PR is an off-by-default revision v2 of the (since-reverted) PR #6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`. See that PR for details on why running with a single runtime is desirable and why we should be ready. We reverted #6555 because it showed regressions in prodlike cloudbench, see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1` for more context. This PR makes it an opt-in choice via an env var. The default is to use the 4 separate runtimes that we have today, there shouldn't be any performance change. I tested manually that the env var & added metric works. ``` # undefined env var => no change to before this PR, uses 4 runtimes ./target/debug/neon_local start # defining the env var enables one-runtime mode, value defines that one runtime's configuration NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start ``` I want to use this change to do more manualy testing and potentially testing in staging. Future Work ----------- Testing / deployment ergonomics would be better if this were a variable in `pageserver.toml`. It can be done, but, I don't need it right now, so let's stick with the env var.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this PR, each core had 3 executor threads from 3 different runtimes. With this PR, we just have one runtime, with one thread per core. Switching to a single tokio runtime should reduce that effective over-commit of CPU and in theory help with tail latencies -- iff all tokio tasks are well-behaved and yield to the runtime regularly.
Are All Tasks Well-Behaved? Are We Ready?
Sadly there doesn't seem to be good out-of-the box tokio tooling to answer this question.
We believe all tasks are well behaved in today's code base, as of the switch to
virtual_file_io_engine = "tokio-epoll-uring"
in production (https://github.com/neondatabase/aws/pull/1121).The only remaining executor-thread-blocking code is walredo and some filesystem namespace operations.
Filesystem namespace operations work is being tracked in #6663 and not considered likely to actually block at this time.
Regarding walredo, it currently does a blocking
poll
for read/write to the pipe file descriptors we use for IPC with the walredo process.There is an ongoing experiment to make walredo async (#6628), but it needs more time because there are surprisingly tricky trade-offs that are articulated in that PR's description (which itself is still WIP). What's relevant for this PR is that
pageserver_wal_redo_seconds_bucket
) areThe ones below 1ms are below our current threshold for when we start thinking about yielding to the executor.
The tens of milliseconds stalls aren't great, but, not least because of the implicit overcommit of CPU by the three runtimes, we can't be sure whether these tens of milliseconds are inherently necessary to do the walredo work or whether we could be faster if there was less contention for CPU.
On the first item (walredo being always CPU-bound work): it means that walredo processes will always compete with the executor threads.
We could yield, using async walredo, but then we hit the trade-offs explained in that PR.
tl;dr: the risk of stalling executor threads through blocking walredo seems low, and switching to one runtime cleans up one potential source for higher-than-necessary stall times (explained in the previous paragraphs).
Code Changes
THE_RUNTIME
.task_mgr
.pausable_failpoint!
should have been used. We encountered some actual failures because of this, e.g., hungget_metric()
calls during test teardown that would client-timeout after 300s.As indicated by the comment above
THE_RUNTIME
, we could take this clean-up further.But before we create so much churn, let's first validate that there's no perf regression.
Performance
We will test this in staging using the various nightly benchmark runs.
However, the worst-case impact of this change is likely compaction (=>image layer creation) competing with compute requests.
Image layer creation work can't be easily generated & repeated quickly by pagebench.
So, we'll simply watch getpage & basebackup tail latencies in staging.
Additionally, I have done manual benchmarking using pagebench.
Report: https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4
Tail latencies and throughput are marginally better (no regression = good).
Except in a workload with 128 clients against one tenant.
There, the p99.9 and p99.99 getpage latency is about 2x worse (at slightly lower throughput).
A dip in throughput every 20s (compaction_period_ is clearly visible, and probably responsible for that worse tail latency.
This has potential to improve with async walredo, and is an edge case workload anyway.
Future Work
THE_RUNTIME
.