-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokio-epoll-uring: root-cause hung pageserver rust unit tests #6373
Comments
Initial conclusions were posted in the thread where the first hang happened: #6355 (comment) When it later hung again, @koivunej and I spent more time investigating it, on the live system. We attached to the hung processes using We see 1 thread from rust test harness (fn main), one for the running test (single-threaded runtime due to tokio::main), and many for the BACKGROUND_RUNTIME.
Both the single-threaded runtime and the multi-threaded runtime were looping here: This means the When break-pointing on the code above in the single-threaded runtime, and then looking at the backtrace, That is weird because, we're pretty sure (should probably confirm this again) that the only task executing on the runtimes was the poller task.1 But, the poller task is a spawned task, not what's being Footnotes
|
We decided that the way forward is to
|
Progress update:
|
Status update: the Rust unit test failures are due to OOMs. Thanks @koivunej for the #6389 Example output from cherry-pick of #6389: https://github.com/neondatabase/neon/actions/runs/7583055464/job/20663745994
I'll investigate that. Questions to answer:
|
I hijacked a
=> 64k of locked memory per process. |
We're allocating about 1k per tokio-epoll-uring thread-local system. Also, here's a set of links I read: |
(@koivunej will investigate why the panics cause the test suite to hang instead of exit) |
I was looking at Regarding other tests than Reproducing the test hangIn the diff --git a/tokio-epoll-uring/src/system/lifecycle.rs b/tokio-epoll-uring/src/system/lifecycle.rs
index 82d6d22..0ff6ce8 100644
--- a/tokio-epoll-uring/src/system/lifecycle.rs
+++ b/tokio-epoll-uring/src/system/lifecycle.rs
@@ -50,6 +50,10 @@ impl System {
pub(crate) async fn launch_with_testing(testing: Option<PollerTesting>) -> SystemHandle {
let id = SYSTEM_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
+ if id != 0 {
+ panic!("foobar");
+ }
+
// TODO: this unbounded channel is the root of all evil: unbounded queue for IOPS; should provie app option to back-pressure instead.
let (submit_side, poller_ready_fut) = {
let (slots_submit_side, slots_completion_side, slots_poller) = super::slots::new(id); Given that there are many flushes in neon/pageserver/src/tenant/timeline.rs Line 2639 in f003dd6
Hanging on receiving the response from flush task is in line with our debugging sessions, because we were observing the Sometimes we saw more runtimes running the 100ms tasks. I think this could be reproduced by allowing creation of 2 We should of course use move semantics on the channel used for flushing requests/completions so that we would be propagated channel closed on panics -- this sort of confirms my earlier recollections that flushing task does not behave well with panics, as when idea of lock poisoning via asserts was thrown around. But apart from that, nothing except the mlock limit to fix on io-uring part. |
I raised the ulimit during test runs (commit in the integration branch) Some processing of the build-debug test output:
The wal failure are becaucse we actually invoke walredo in those tests and they produce error
probably because we don't clean up the shmem. Now running without the raised ulimit to ensure it indeed fixes the issue. Commit in the integration branch.
Weird, same set of tests that's failing. |
Logging further discussion regarding the remaining test failures: I noticed I have dangling shmem segments on my workstation after repeated test runs using Since we have only a few walcraft tests, it seems not all of them leak, but at least these leak (
v15 and v16 do not seem to leak every time, but repeated runs show they leak as well.
|
I tried to repro the mlock OOM on my local debian bullseye system, using the following test case, without success. #[tokio::test]
async fn hitting_memlock_limit_does_not_panic() {
let (soft, hard) =
nix::sys::resource::getrlimit(nix::sys::resource::Resource::RLIMIT_MEMLOCK).unwrap();
let low_enough_soft_limit_so_the_test_is_fast_bytes = 64 * 1024;
nix::sys::resource::setrlimit(
nix::sys::resource::Resource::RLIMIT_MEMLOCK,
low_enough_soft_limit_so_the_test_is_fast_bytes,
hard,
)
.unwrap();
scopeguard::defer!({
nix::sys::resource::setrlimit(nix::sys::resource::Resource::RLIMIT_MEMLOCK, soft, hard)
.unwrap();
});
let mut systems = Vec::new();
loop {
match System::launch().await {
Ok(system) => {
// use the uring in case the memory is allocated lazily
let ((), res) = system.nop().await;
res.unwrap();
systems.push(system); // keep alive until end of test
}
Err(e) => match e {
crate::system::lifecycle::LaunchResult::IoUringBuild(e) => {
assert_eq!(e.kind(), std::io::ErrorKind::OutOfMemory);
// run this test with --test-threads=1 or nextest to get predictable results for systems.len() under a given ulimit
println!("hit limit after {} iterations", systems.len());
return;
}
},
}
}
} That sent me down a rabbit hole, spelunking in the kernel git history and liburing to see whether anything changed about mlock. https://github.com/axboe/liburing/blob/63342497bec87b72d4b415c8bed930c15ff83b0d/README#L37-L39
Ok, interesting, so, from 5.12 on, the io_uring cq and sq themselves aren't locked memory anymore. After some spelunking, I found the relevant commit
Now, on bullseye, we have kernel 5.10. (For posterity, the spelunking went as follows: read the On my system, where I can't repro the mlock OOM, I'm running
The system where we saw the OOM is runner
So, these systems are running different kernels (the architecture difference doesn't matter, the relevant parts of io_uring are architecture-independent code). The version numbers work as follows: Let's ignore that Debian might apply additional patches, and just compare the upstream LTS versions.
WTF. Look at io_uring changes in the debian package changelog
Ok, we know that LTS kernels sometimes don't just cherry-pick patches but deviate significantly from the original X.Y release. Let's go through all the commits on the 5.10 LTS branch and search for
only commit I found is
That's interesting.
Let's see which LTS releases contain that commit
NICE!
Will draw conclusions after lunch. |
…eatures (#34) I looked into `dontfork()` because of neondatabase/neon#6373 ; I don't think it matters for that particular issue, but, it's still unsafe to share the memory mappings with another process because the synchronization primitives that tokio-epoll-uring wraps around them wouldn't be shared across processes. Also take the opportunity to review io_uring feature flags and make sure they're there or produce some papertrail explaining why we don't care.
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE - if an io engine other than std-fs is used, it shows up in the test name; this is so that we can continue to use the flaky tests database - raise memlock limit & while at it also raise shmem limit for the Rust tests. It's need on our older runners that use an older 5.10.X LTS kernel, where io_uring SQ and CQ still counted towards the rlimit, see #6373 (comment) for details. Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Conclusions:
|
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE - if an io engine other than std-fs is used, it shows up in the test name; this is so that we can continue to use the flaky tests database - raise memlock limit & while at it also raise shmem limit for the Rust tests. It's need on our older runners that use an older 5.10.X LTS kernel, where io_uring SQ and CQ still counted towards the rlimit, see #6373 (comment) for details. Co-authored-by: Alexander Bayandin <alexander@neon.tech>
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE - if an io engine other than std-fs is used, it shows up in the test name; this is so that we can continue to use the flaky tests database - raise memlock limit & while at it also raise shmem limit for the Rust tests. It's need on our older runners that use an older 5.10.X LTS kernel, where io_uring SQ and CQ still counted towards the rlimit, see #6373 (comment) for details. Co-authored-by: Alexander Bayandin <alexander@neon.tech>
refs #7136 Problem ------- Before this PR, we were using `tokio_epoll_uring::thread_local_system()`, which panics on tokio_epoll_uring::System::launch() failure As we've learned in [the past](#6373 (comment)), some older Linux kernels account io_uring instances as locked memory. And while we've raised the limit in prod considerably, we did hit it once on 2024-03-11 16:30 UTC. That was after we enabled tokio-epoll-uring fleet-wide, but before we had shipped release-5090 (c6ed86d) which did away with the last mass-creation of tokio-epoll-uring instances as per commit 3da410c Author: Christian Schwarz <christian@neon.tech> Date: Tue Mar 5 10:03:54 2024 +0100 tokio-epoll-uring: use it on the layer-creating code paths (#6378) Nonetheless, it highlighted that panicking in this situation is probably not ideal, as it can leave the pageserver process in a semi-broken state. Further, due to low sampling rate of Prometheus metrics, we don't know much about the circumstances of this failure instance. Solution -------- This PR implements a custom thread_local_system() that is pageserver-aware and will do the following on failure: - dump relevant stats to `tracing!`, hopefully they will be useful to understand the circumstances better - if it's the locked memory failure (or any other ENOMEM): abort() the process - if it's ENOMEM, retry with exponential back-off, capped at 3s. - add metric counters so we can create an alert This makes sense in the production environment where we know that _usually_, there's ample locked memory allowance available, and we know the failure rate is rare.
refs #7136 Problem ------- Before this PR, we were using `tokio_epoll_uring::thread_local_system()`, which panics on tokio_epoll_uring::System::launch() failure As we've learned in [the past](#6373 (comment)), some older Linux kernels account io_uring instances as locked memory. And while we've raised the limit in prod considerably, we did hit it once on 2024-03-11 16:30 UTC. That was after we enabled tokio-epoll-uring fleet-wide, but before we had shipped release-5090 (c6ed86d) which did away with the last mass-creation of tokio-epoll-uring instances as per commit 3da410c Author: Christian Schwarz <christian@neon.tech> Date: Tue Mar 5 10:03:54 2024 +0100 tokio-epoll-uring: use it on the layer-creating code paths (#6378) Nonetheless, it highlighted that panicking in this situation is probably not ideal, as it can leave the pageserver process in a semi-broken state. Further, due to low sampling rate of Prometheus metrics, we don't know much about the circumstances of this failure instance. Solution -------- This PR implements a custom thread_local_system() that is pageserver-aware and will do the following on failure: - dump relevant stats to `tracing!`, hopefully they will be useful to understand the circumstances better - if it's the locked memory failure (or any other ENOMEM): abort() the process - if it's ENOMEM, retry with exponential back-off, capped at 3s. - add metric counters so we can create an alert This makes sense in the production environment where we know that _usually_, there's ample locked memory allowance available, and we know the failure rate is rare.
While working on #5824 (sub-pr #6355 specifically), we found that sometimes the Rust unit tests would hang in CI.
The problem isn't reproducible locally.
It reproduced 3 times in CI so far.
Tasks
The text was updated successfully, but these errors were encountered: