Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokio-epoll-uring: root-cause hung pageserver rust unit tests #6373

Closed
1 task done
Tracked by #4744
problame opened this issue Jan 17, 2024 · 12 comments
Closed
1 task done
Tracked by #4744

tokio-epoll-uring: root-cause hung pageserver rust unit tests #6373

problame opened this issue Jan 17, 2024 · 12 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@problame
Copy link
Contributor

problame commented Jan 17, 2024

While working on #5824 (sub-pr #6355 specifically), we found that sometimes the Rust unit tests would hang in CI.
The problem isn't reproducible locally.
It reproduced 3 times in CI so far.

Tasks

Preview Give feedback
  1. a/test
@problame
Copy link
Contributor Author

problame commented Jan 17, 2024

Initial conclusions were posted in the thread where the first hang happened: #6355 (comment)

When it later hung again, @koivunej and I spent more time investigating it, on the live system.

We attached to the hung processes using nsenter ... + gdb .

We see 1 thread from rust test harness (fn main), one for the running test (single-threaded runtime due to tokio::main), and many for the BACKGROUND_RUNTIME.

root@large-06:/__w/neon/neon# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nonroot        1  0.0  0.0   1048     4 ?        Ss   13:43   0:00 /sbin/docker-init -- tail -f /dev/null
nonroot        7  0.0  0.0   2424   504 ?        S    13:43   0:00 tail -f /dev/null
nonroot   116276  0.0  0.0   3896  2888 ?        Ss   13:48   0:00 bash -euxo pipefail /__w/_temp/43fba0b2-ccac-45b4-9d66-716eb78caef6.sh
nonroot   120401  0.0  0.0 2200300 30852 ?       Sl   13:49   0:00 /home/nonroot/.cargo/bin/cargo-nextest nextest run --locked --release --features testing
nonroot   121778  0.0  0.0 2530800 105800 ?      SLl  13:49   0:01 /__w/neon/neon/target/release/deps/pageserver-84319a60e7b86403 --exact tenant::remote_timeline_client::tests::bytes_unfinished_gauge_for_layer_file_uploads --nocapture
nonroot   121786  0.0  0.0 2530800 108132 ?      SLl  13:49   0:01 /__w/neon/neon/target/release/deps/pageserver-84319a60e7b86403 --exact tenant::remote_timeline_client::tests::upload_scheduling --nocapture
root      121945  0.0  0.0   6052  3676 ?        S    14:50   0:00 -bash
root      121951  0.0  0.0   8648  3312 ?        R+   14:51   0:00 ps aux




(gdb) info threads
  Id   Target Id                                            Frame
* 1    Thread 0x7f67d6a92500 (LWP 121778) "pageserver-8431" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7f67d6a90700 (LWP 121802) "tenant::remote_" 0x00007f67d6b8ed56 in epoll_wait (epfd=3, events=0x7f67d0000c20, maxevents=1024, timeout=69) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3    Thread 0x7f67d6687700 (LWP 121863) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7f67d6486700 (LWP 121864) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  5    Thread 0x7f67d6282700 (LWP 121865) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6    Thread 0x7f67d607e700 (LWP 121866) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7f67d5e7a700 (LWP 121867) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8    Thread 0x7f67d5c76700 (LWP 121868) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7f67d5a72700 (LWP 121869) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  10   Thread 0x7f67d586e700 (LWP 121870) "background op w" 0x00007f67d6b8ed56 in epoll_wait (epfd=11, events=0x7f67d00962b0, maxevents=1024, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7f67d566a700 (LWP 121871) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  12   Thread 0x7f67d5466700 (LWP 121872) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  13   Thread 0x7f67d5262700 (LWP 121873) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  14   Thread 0x7f67d505e700 (LWP 121874) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  15   Thread 0x7f67d4e5a700 (LWP 121875) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  16   Thread 0x7f67d4c56700 (LWP 121876) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  17   Thread 0x7f67d4a52700 (LWP 121877) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  18   Thread 0x7f67d484e700 (LWP 121878) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  19   Thread 0x7f67d464a700 (LWP 121879) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  20   Thread 0x7f67d4446700 (LWP 121880) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  21   Thread 0x7f67d423f700 (LWP 121881) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  22   Thread 0x7f6787fff700 (LWP 121882) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  23   Thread 0x7f6787dfe700 (LWP 121883) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  24   Thread 0x7f6787bfd700 (LWP 121884) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  25   Thread 0x7f67879fc700 (LWP 121885) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  26   Thread 0x7f67877fb700 (LWP 121886) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  27   Thread 0x7f67875fa700 (LWP 121887) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  28   Thread 0x7f67873f9700 (LWP 121888) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  29   Thread 0x7f67871f8700 (LWP 121889) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  30   Thread 0x7f6786ff7700 (LWP 121890) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  31   Thread 0x7f6786df6700 (LWP 121891) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  32   Thread 0x7f6786bf5700 (LWP 121892) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  33   Thread 0x7f67869f4700 (LWP 121893) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  34   Thread 0x7f67867f3700 (LWP 121894) "background op w" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

Both the single-threaded runtime and the multi-threaded runtime were looping here:
https://github.com/neondatabase/tokio-epoll-uring/blob/0dd3a2f8bf3239d34a19719ef1a74146c093126f/tokio-epoll-uring/src/system/completion.rs#L469-L471

This means the oneshot_nonconsuming::Sender inside the SystemHandle hadn't been dropped yet, otherwise the shutdown_rx.recv() further up in the tokio::select! statement would have been taken.

When break-pointing on the code above in the single-threaded runtime, and then looking at the backtrace, rust-gdb output suggests we're still executing the tokio::test. I think I remember that it showed that we were we observe that we're still in the block_on() call that is inserted by the tokio::test macro.

That is weird because, we're pretty sure (should probably confirm this again) that the only task executing on the runtimes was the poller task.1

But, the poller task is a spawned task, not what's being block_oned. The block_oned task is the test, and we believe it was complete.

Footnotes

  1. It's easy to confirm this on the single-threaded-runtime, just add a break-point in the tokio single-threaded runtime scheduler/dispatcher code.

@problame
Copy link
Contributor Author

We decided that the way forward is to

@problame
Copy link
Contributor Author

Progress update:

@problame
Copy link
Contributor Author

problame commented Jan 19, 2024

Status update: the Rust unit test failures are due to OOMs. Thanks @koivunej for the #6389

Example output from cherry-pick of #6389: https://github.com/neondatabase/neon/actions/runs/7583055464/job/20663745994

FAIL [   0.215s] pageserver tenant::ephemeral_file::tests::test_ephemeral_blobs

--- STDOUT:              pageserver tenant::ephemeral_file::tests::test_ephemeral_blobs ---

running 1 test
test tenant::ephemeral_file::tests::test_ephemeral_blobs ... FAILED

failures:

failures:
    tenant::ephemeral_file::tests::test_ephemeral_blobs

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 106 filtered out; finished in 0.18s


--- STDERR:              pageserver tenant::ephemeral_file::tests::test_ephemeral_blobs ---
thread 'tenant::ephemeral_file::tests::test_ephemeral_blobs' panicked at .cargo/git/checkouts/tokio-epoll-uring-f670d5b919cd2b86/0dd3a2f/tokio-epoll-uring/src/system/lifecycle.rs:56:82:
called `Result::unwrap()` on an `Err` value: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1653:5
   3: <core::result::Result<io_uring::IoUring, std::io::error::Error>>::unwrap
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1077:23
   4: <tokio_epoll_uring::system::lifecycle::System>::launch_with_testing::{closure#0}
             at .cargo/git/checkouts/tokio-epoll-uring-f670d5b919cd2b86/0dd3a2f/tokio-epoll-uring/src/system/lifecycle.rs:56:48
   5: <tokio_epoll_uring::system::lifecycle::System>::launch::{closure#0}
             at .cargo/git/checkouts/tokio-epoll-uring-f670d5b919cd2b86/0dd3a2f/tokio-epoll-uring/src/system/lifecycle.rs:47:41
   6: <tokio::sync::once_cell::OnceCell<tokio_epoll_uring::system::lifecycle::handle::SystemHandle>>::get_or_init::<<tokio_epoll_uring::system::lifecycle::System>::launch, <tokio_epoll_uring::system::lifecycle::System>::launch::{closure#0}>::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/sync/once_cell.rs:368:37
   7: tokio_epoll_uring::system::lifecycle::thread_local::thread_local_system::{closure#0}
             at .cargo/git/checkouts/tokio-epoll-uring-f670d5b919cd2b86/0dd3a2f/tokio-epoll-uring/src/system/lifecycle/thread_local.rs:14:45
   8: <pageserver::virtual_file::open_options::OpenOptions>::open::{closure#0}
             at pageserver/src/virtual_file/open_options.rs:100:71
   9: <pageserver::virtual_file::VirtualFile>::open_with_options::{closure#0}
             at pageserver/src/virtual_file.rs:375:51
  10: <pageserver::tenant::ephemeral_file::EphemeralFile>::create::{closure#0}
             at pageserver/src/tenant/ephemeral_file.rs:55:10
  11: pageserver::tenant::ephemeral_file::tests::test_ephemeral_blobs::{closure#0}
             at pageserver/src/tenant/ephemeral_file.rs:314:76
  12: <core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
  13: <core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>> as core::future::future::Future>::poll
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125:9
  14: <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on::<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}::{closure#0}::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:57
  15: tokio::runtime::coop::with_budget::<core::task::poll::Poll<core::result::Result<(), std::io::error::Error>>, <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}::{closure#0}::{closure#0}>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
  16: tokio::runtime::coop::budget::<core::task::poll::Poll<core::result::Result<(), std::io::error::Error>>, <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}::{closure#0}::{closure#0}>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
  17: <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on::<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:665:25
  18: <tokio::runtime::scheduler::current_thread::Context>::enter::<core::task::poll::Poll<core::result::Result<(), std::io::error::Error>>, <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}::{closure#0}>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:410:19
  19: <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on::<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:664:36
  20: <tokio::runtime::scheduler::current_thread::CoreGuard>::enter::<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:68
  21: <tokio::runtime::context::scoped::Scoped<tokio::runtime::scheduler::Context>>::set::<<tokio::runtime::scheduler::current_thread::CoreGuard>::enter<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>)>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/scoped.rs:40:9
  22: tokio::runtime::context::set_scheduler::<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>), <tokio::runtime::scheduler::current_thread::CoreGuard>::enter<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}>::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:26
  23: <std::thread::local::LocalKey<tokio::runtime::context::Context>>::try_with::<tokio::runtime::context::set_scheduler<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>), <tokio::runtime::scheduler::current_thread::CoreGuard>::enter<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}>::{closure#0}, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>)>
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/thread/local.rs:270:16
  24: <std::thread::local::LocalKey<tokio::runtime::context::Context>>::with::<tokio::runtime::context::set_scheduler<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>), <tokio::runtime::scheduler::current_thread::CoreGuard>::enter<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}>::{closure#0}, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>)>
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/thread/local.rs:246:9
  25: tokio::runtime::context::set_scheduler::<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core>, core::option::Option<core::result::Result<(), std::io::error::Error>>), <tokio::runtime::scheduler::current_thread::CoreGuard>::enter<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>::{closure#0}>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context.rs:176:9
  26: <tokio::runtime::scheduler::current_thread::CoreGuard>::enter::<<tokio::runtime::scheduler::current_thread::CoreGuard>::block_on<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>::{closure#0}, core::option::Option<core::result::Result<(), std::io::error::Error>>>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:743:27
  27: <tokio::runtime::scheduler::current_thread::CoreGuard>::block_on::<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:[652](https://github.com/neondatabase/neon/actions/runs/7583055464/job/20663745994#step:19:653):19
  28: <tokio::runtime::scheduler::current_thread::CurrentThread>::block_on::<core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>::{closure#0}
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  29: tokio::runtime::context::runtime::enter_runtime::<<tokio::runtime::scheduler::current_thread::CurrentThread>::block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>::{closure#0}, core::result::Result<(), std::io::error::Error>>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  30: <tokio::runtime::scheduler::current_thread::CurrentThread>::block_on::<core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  31: <tokio::runtime::runtime::Runtime>::block_on::<core::pin::Pin<&mut dyn core::future::future::Future<Output = core::result::Result<(), std::io::error::Error>>>>
             at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:348:47
  32: pageserver::tenant::ephemeral_file::tests::test_ephemeral_blobs
             at pageserver/src/tenant/ephemeral_file.rs:366:9
  33: pageserver::tenant::ephemeral_file::tests::test_ephemeral_blobs::{closure#0}
             at pageserver/src/tenant/ephemeral_file.rs:311:40
  34: <pageserver::tenant::ephemeral_file::tests::test_ephemeral_blobs::{closure#0} as core::ops::function::FnOnce<()>>::call_once
             at /rustc/82e1608dfa6e0b55[692](https://github.com/neondatabase/neon/actions/runs/7583055464/job/20663745994#step:19:693)32559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
  35: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'll investigate that.

Questions to answer:

  1. Why does the panic make the test hang?
  2. the io_uring ring buffers are mlocked; could the Docker environment limit that to a low number / the high degree of concurrency on the large-* runners (32 cores => 32 concurrent nextest) be the limiting factor?

@problame
Copy link
Contributor Author

I hijacked a large- runner. Inside the docker container (docker exec), here's ulimit output

nonroot@1c45f8fd824d:/__w/neon/neon$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 514764
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

=> 64k of locked memory per process.

@problame
Copy link
Contributor Author

problame commented Jan 19, 2024

We're allocating about 1k per tokio-epoll-uring thread-local system.
The limit is per-process, and tokio-epoll-uring only gets used from executor threads, so, locked memory usage would be num_executor_threads x 1k.
The default limit is 64k.
The large runners have 32 cores.
We have, I believe 33 runtime threads.
64k should be enough, but maybe my 1k number is wrong.
I'll work with the CI team to raise the mlock limit on the CI runners.

Also, here's a set of links I read:

@problame
Copy link
Contributor Author

(@koivunej will investigate why the panics cause the test suite to hang instead of exit)

@koivunej
Copy link
Member

koivunej commented Jan 20, 2024

I was looking at tenant::tests::test_images which was the only test I remembered from debugging sessions. Reproduction of the hang condition is trivial and I don't think there is anything to fix tokio-epoll-uring apart from the config fix in relation to mlock limit.

Regarding other tests than test_images, for example tenant::ephemeral_file::tests::test_ephemeral_blobs will fail-fast not a hang. bytes_unfinished_gauge_for_layer_file_uploads does similar timeline setup and flush task usage as test_images, similar to other RemoteTimelineClient hangs.


Reproducing the test hang

In the test_images Tenant::create_test_timeline will call save_metadata which will start an tokio-epoll-uring on the #[tokio::test] (current_thread) runtime. That panic would always fail the test, so I had the idea of needing to panic on following tokio-epoll-uring launches. Those launches will be caused by running the layer flushing task, so we get to reproduce the hang.

diff --git a/tokio-epoll-uring/src/system/lifecycle.rs b/tokio-epoll-uring/src/system/lifecycle.rs
index 82d6d22..0ff6ce8 100644
--- a/tokio-epoll-uring/src/system/lifecycle.rs
+++ b/tokio-epoll-uring/src/system/lifecycle.rs
@@ -50,6 +50,10 @@ impl System {
     pub(crate) async fn launch_with_testing(testing: Option<PollerTesting>) -> SystemHandle {
         let id = SYSTEM_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed);

+        if id != 0 {
+            panic!("foobar");
+        }
+
         // TODO: this unbounded channel is the root of all evil: unbounded queue for IOPS; should provie app option to back-pressure instead.
         let (submit_side, poller_ready_fut) = {
             let (slots_submit_side, slots_completion_side, slots_poller) = super::slots::new(id);

Given that there are many flushes in test_images, I cannot see how the test could had reached Ok(()) as we were mistakenly seeing while debugging a stuck test run. After layer flushing task has panicked the tests calls on Timeline::freeze_and_flush will always hang on

rx_e = rx.changed() => {

Hanging on receiving the response from flush task is in line with our debugging sessions, because we were observing the #[tokio::test] runtime waiting on the outermost (block_on) future, while executing other work (100ms sleep from tokio-epoll-uring). We just never saw anything of the test future because they were waiting on flushing to receive a completion message, which would never happen. This sort of "lost wakeup" cannot be seen in debugger. Panic happened in the spawned flushing task on BACKGROUND_RUNTIME which is why the test does not "react" to the panic.

Sometimes we saw more runtimes running the 100ms tasks. I think this could be reproduced by allowing creation of 2 tokio-epoll-uring systems, but having the layer flush task panic on a 3rd system launch, because it executed on one of the 32 threads.

We should of course use move semantics on the channel used for flushing requests/completions so that we would be propagated channel closed on panics -- this sort of confirms my earlier recollections that flushing task does not behave well with panics, as when idea of lock poisoning via asserts was thrown around.

But apart from that, nothing except the mlock limit to fix on io-uring part.

@problame
Copy link
Contributor Author

problame commented Jan 22, 2024

I raised the ulimit during test runs (commit in the integration branch)

Some processing of the build-debug test output:

cs@cs-neon-mbp:[~]: cat /Users/cs/Downloads/39.txt | grep 'Cannot allocate memory'
<no output>
cs@cs-neon-mbp:[~]: cat /Users/cs/Downloads/39.txt | grep -E '470 tests run|FAIL \[' | grep FAIL | gawk 'match($0,/.*\] (.*)/,m) { count[m[1]]++ } END { for (test in count) { print test,count[test]; } }'
pageserver walredo::tests::short_v14_redo 14
wal_craft v16::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 98
wal_craft v14::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 18
wal_craft v15::xlog_utils_test::test_find_end_of_wal_simple 86
wal_craft v14::xlog_utils_test::test_find_end_of_wal_simple 18
wal_craft v15::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 76
pageserver walredo::tests::short_v14_fails_for_wrong_key_but_returns_zero_page 10
wal_craft v14::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 18
wal_craft v16::xlog_utils_test::test_find_end_of_wal_simple 98
wal_craft v16::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 94
wal_craft v15::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 84

The wal failure are becaucse we actually invoke walredo in those tests and they produce error

could not resize shared memory segment "/PostgreSQL.3282941536" to 26976 bytes: No space left on device

probably because we don't clean up the shmem.


Now running without the raised ulimit to ensure it indeed fixes the issue.
Would expect to see the out-of-memory test failures there.

Commit in the integration branch.

cs@cs-neon-mbp:[~]: cat /Users/cs/Downloads/35.txt | grep 'Cannot allocate memory'
<no output>
cs@cs-neon-mbp:[~]: cat /Users/cs/Downloads/35.txt  | grep -E '470 tests run|FAIL \[' | grep FAIL | gawk 'match($0,/.*\] (.*)/,m) { count[m[1]]++ } END { for (test in count) { print test,count[test]; } }'
pageserver walredo::tests::short_v14_redo 14
wal_craft v16::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 90
wal_craft v15::xlog_utils_test::test_find_end_of_wal_simple 78
wal_craft v14::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 14
wal_craft v14::xlog_utils_test::test_find_end_of_wal_simple 8
wal_craft v15::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 76
pageserver walredo::tests::short_v14_fails_for_wrong_key_but_returns_zero_page 14
wal_craft v14::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one 12
wal_craft v16::xlog_utils_test::test_find_end_of_wal_simple 90
wal_craft v16::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 90
wal_craft v15::xlog_utils_test::test_find_end_of_wal_last_crossing_segment 72

Weird, same set of tests that's failing.

@koivunej
Copy link
Member

koivunej commented Jan 22, 2024

Logging further discussion regarding the remaining test failures: I noticed I have dangling shmem segments on my workstation after repeated test runs using ipcs. We should probably attempt ipcrm -a in the test loop of 8a16fcc.

Since we have only a few walcraft tests, it seems not all of them leak, but at least these leak (cargo nextest -p wal_craft ...), one 56 byte segment each:

  • v14::xlog_utils_test::test_update_next_xid (cannot reproduce, maybe not always?)
  • v14::xlog_utils_test::test_find_end_of_wal_simple
  • v14::xlog_utils_test::test_find_end_of_wal_last_crossing_segment
  • v14::xlog_utils_test::test_find_end_of_wal_crossing_segment_followed_by_small_one

v15 and v16 do not seem to leak every time, but repeated runs show they leak as well.

walredo tests do not leak (yet they use v14).

@problame
Copy link
Contributor Author

problame commented Jan 23, 2024

I tried to repro the mlock OOM on my local debian bullseye system, using the following test case, without success.

#[tokio::test]
async fn hitting_memlock_limit_does_not_panic() {
    let (soft, hard) =
        nix::sys::resource::getrlimit(nix::sys::resource::Resource::RLIMIT_MEMLOCK).unwrap();
    let low_enough_soft_limit_so_the_test_is_fast_bytes = 64 * 1024;
    nix::sys::resource::setrlimit(
        nix::sys::resource::Resource::RLIMIT_MEMLOCK,
        low_enough_soft_limit_so_the_test_is_fast_bytes,
        hard,
    )
    .unwrap();
    scopeguard::defer!({
        nix::sys::resource::setrlimit(nix::sys::resource::Resource::RLIMIT_MEMLOCK, soft, hard)
            .unwrap();
    });

    let mut systems = Vec::new();
    loop {
        match System::launch().await {
            Ok(system) => {
                // use the uring in case the memory is allocated lazily
                let ((), res) = system.nop().await;
                res.unwrap();
                systems.push(system); // keep alive until end of test
            }
            Err(e) => match e {
                crate::system::lifecycle::LaunchResult::IoUringBuild(e) => {
                    assert_eq!(e.kind(), std::io::ErrorKind::OutOfMemory);
                    // run this test with --test-threads=1 or nextest to get predictable results for systems.len() under a given ulimit
                    println!("hit limit after {} iterations", systems.len());
                    return;
                }
            },
        }
    }
}

That sent me down a rabbit hole, spelunking in the kernel git history and liburing to see whether anything changed about mlock.


https://github.com/axboe/liburing/blob/63342497bec87b72d4b415c8bed930c15ff83b0d/README#L37-L39

ulimit settings
---------------

io_uring accounts memory it needs under the rlimit memlocked option, which
can be quite low on some setups (64K). The default is usually enough for
most use cases, but bigger rings or things like registered buffers deplete
it quickly. root isn't under this restriction, but regular users are. Going
into detail on how to bump the limit on various systems is beyond the scope
of this little blurb, but check /etc/security/limits.conf for user specific
settings, or /etc/systemd/user.conf and /etc/systemd/system.conf for systemd
setups. This affects 5.11 and earlier, new kernels are less dependent
on RLIMIT_MEMLOCK as it is only used for registering buffers.

Ok, interesting, so, from 5.12 on, the io_uring cq and sq themselves aren't locked memory anymore.

After some spelunking, I found the relevant commit

commit 26bfa89e25f42d2b26fe951bbcf04bb13937fbba
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Feb 9 20:14:12 2021 -0700

    io_uring: place ring SQ/CQ arrays under memcg memory limits

    Instead of imposing rlimit memlock limits for the rings themselves,
    ensure that we account them properly under memcg with __GFP_ACCOUNT.
    We retain rlimit memlock for registered buffers, this is just for the
    ring arrays themselves.

    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Now, on bullseye, we have kernel 5.10.
Except, actually we have some 5.10.X LTS kernel plus some debian patches on top.

(For posterity, the spelunking went as follows: read the io_uring_setup code, spot a symbol ACCT_PINNED, search git log -p v5.10..v5.12 -- fs/io_uring.c for patches that contain ACCT_PINNED)


On my system, where I can't repro the mlock OOM, I'm running

Linux devvm-mbp 5.10.0-26-arm64 #1 SMP Debian 5.10.197-1 (2023-09-29) aarch64 GNU/Linux

The system where we saw the OOM is runner large-03 (which since the run hopefully didn't have a kernel update) at kernel version

cs@devvm-mbp:[~]: tsh ssh developer@large-03
$ uname -a
Linux large-03 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux

So, these systems are running different kernels (the architecture difference doesn't matter, the relevant parts of io_uring are architecture-independent code).

The version numbers work as follows: 5.10.0-26-arm64 is the debian package linux-image-... name. The 5.10.197 is the LTS kernel version.

Let's ignore that Debian might apply additional patches, and just compare the upstream LTS versions.

git log v5.10.149 doesn't contain the commit.

git log v5.10.197 also doesn't contain the commit.

WTF.


Look at io_uring changes in the debian package changelog

cs@devvm-mbp:[~/ext]: git clone https://salsa.debian.org/kernel-team/linux.git debian-linux-git
cs@devvm-mbp:[~/ext]: cd debian-linux-git/debian
cs@devvm-mbp:[~/ext/debian-linux-git/debian]: git log HEAD -n 1 | cat
commit b5b4209e6c477ab53bb662a061f9aa09d26b8259
Author: Salvatore Bonaccorso <carnil@debian.org>
Date:   Wed Jan 3 07:56:13 2024 +0100

    Add ABI reference for 5.10.0-27

    Gbp-Dch: Ignore
cs@devvm-mbp:[~/ext/debian-linux-git/debian]: cat changelog | grep -E '^(\S|.*io_uring)'
...
our patch isn't listed there either
...

Ok, we know that LTS kernels sometimes don't just cherry-pick patches but deviate significantly from the original X.Y release.

Let's go through all the commits on the 5.10 LTS branch and search for ACCT_LOCKED in the less output

cs@devvm-mbp:[~/ext/linux]: git log -p v5.10.1 v5.10.208 --grep 'io_uring'

only commit I found is

commit 788d0824269bef539fe31a785b1517882eafed93
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Dec 22 14:30:11 2022 -0700

    io_uring: import 5.15-stable io_uring

    No upstream commit exists.

    This imports the io_uring codebase from 5.15.85, wholesale. Changes
    from that code base:

    - Drop IOCB_ALLOC_CACHE, we don't have that in 5.10.
    - Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports,
      and we don't support these in 5.10 to begin with.
    - sock_from_file() old style calling convention.
    - Use compat_get_bitmap() only for CONFIG_COMPAT=y

    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

That's interesting.

This imports the io_uring codebase from 5.15.85, wholesale. Changes
from that code base:

Let's see which LTS releases contain that commit

cs@devvm-mbp:[~/ext/linux]: git tag -l --contains 788d0824269bef539fe31a785b1517882eafed93
v5.10.162
v5.10.163
v5.10.164
v5.10.165
v5.10.166
v5.10.167
v5.10.168
v5.10.169
v5.10.170
v5.10.171
v5.10.172
v5.10.173
v5.10.174
v5.10.175
v5.10.176
v5.10.177
v5.10.178
v5.10.179
v5.10.180
v5.10.181
v5.10.182
v5.10.183
v5.10.184
v5.10.185
v5.10.186
v5.10.187
v5.10.188
v5.10.189
v5.10.190
v5.10.191
v5.10.192
v5.10.193
v5.10.194
v5.10.195
v5.10.196
v5.10.197
v5.10.198
v5.10.199
v5.10.200
v5.10.201
v5.10.202
v5.10.203
v5.10.204
v5.10.205
v5.10.206
v5.10.207
v5.10.208

NICE!

v5.10.149 (where we saw the mlock situation) is not on that list, so, it doesn't have the patch.

v5.10.197 is on above list => contains the patch => can't repro the issue there.


Will draw conclusions after lunch.

problame added a commit to neondatabase/tokio-epoll-uring that referenced this issue Jan 23, 2024
…eatures (#34)

I looked into `dontfork()` because of
neondatabase/neon#6373 ;

I don't think it matters for that particular issue, but, it's still
unsafe to share the memory mappings with another process because the
synchronization primitives that tokio-epoll-uring wraps around them
wouldn't be shared across processes.

Also take the opportunity to review io_uring feature flags and make sure
they're there or produce some papertrail explaining why we don't care.
problame added a commit that referenced this issue Jan 23, 2024
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE
- if an io engine other than std-fs is used, it shows up in the test
  name; this is so that we can continue to use the flaky tests database
- raise memlock limit & while at it also raise shmem limit
  for the Rust tests. It's need on our older runners that use
  an older 5.10.X LTS kernel, where io_uring SQ and CQ still
  counted towards the rlimit, see
  #6373 (comment)
  for details.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
@problame
Copy link
Contributor Author

problame commented Jan 24, 2024

Conclusions:

problame added a commit that referenced this issue Jan 24, 2024
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE
- if an io engine other than std-fs is used, it shows up in the test
  name; this is so that we can continue to use the flaky tests database
- raise memlock limit & while at it also raise shmem limit
  for the Rust tests. It's need on our older runners that use
  an older 5.10.X LTS kernel, where io_uring SQ and CQ still
  counted towards the rlimit, see
  #6373 (comment)
  for details.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
problame added a commit that referenced this issue Jan 25, 2024
- control via env var PAGESERVER_VIRTUAL_FILE_IO_ENGINE
- if an io engine other than std-fs is used, it shows up in the test
  name; this is so that we can continue to use the flaky tests database
- raise memlock limit & while at it also raise shmem limit
  for the Rust tests. It's need on our older runners that use
  an older 5.10.X LTS kernel, where io_uring SQ and CQ still
  counted towards the rlimit, see
  #6373 (comment)
  for details.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
problame added a commit that referenced this issue Mar 15, 2024
refs #7136

Problem
-------

Before this PR, we were using `tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in [the
past](#6373 (comment)),
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

    commit 3da410c
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Tue Mar 5 10:03:54 2024 +0100

        tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution
--------

This PR implements a custom thread_local_system() that is pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
  understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
  process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert

This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
problame added a commit that referenced this issue Mar 15, 2024
refs #7136

Problem
-------

Before this PR, we were using
`tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in [the

past](#6373 (comment)),
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

    commit 3da410c
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Tue Mar 5 10:03:54 2024 +0100

tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken
state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution
--------

This PR implements a custom thread_local_system() that is
pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
  understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
  process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert

This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants