Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panics when using tokio_taskdump #6051

Closed
DmitrySamoylov opened this issue Oct 5, 2023 · 5 comments · Fixed by #6194
Closed

Panics when using tokio_taskdump #6051

DmitrySamoylov opened this issue Oct 5, 2023 · 5 comments · Fixed by #6194
Assignees
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-taskdump --cfg tokio_taskdump

Comments

@DmitrySamoylov
Copy link

Version
Tested in:

Platform
Linux komp 5.4.0-163-generic #180-Ubuntu SMP Tue Sep 5 13:21:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Description
Program panics when using tokio_taskdump and tokio::time::sleep() here:

https://github.com/tokio-rs/tokio/blob/tokio-1.32.0/tokio/src/runtime/task/state.rs#L118

Details

I used the examples/dump.rs file as a template:

  • added another task that sleeps in a loop
  • removed waiting for Ctrl+C to make it dump the tasks in a loop
  • removed barrier tasks because the issue can be reproduced without those
use std::time::Duration;

#[cfg(all(
    tokio_unstable,
    tokio_taskdump,
    target_os = "linux",
    any(target_arch = "aarch64", target_arch = "x86", target_arch = "x86_64")
))]
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    async fn dump_or_quit() {
        use tokio::time::{timeout, Duration};
        let handle = tokio::runtime::Handle::current();

        let mut cnt = 1;

        loop {
            if timeout(Duration::from_secs(2), handle.dump()).await.is_ok() {
                println!("Got tasks {cnt}");
                cnt += 1;
            } else {
                println!("Task dumping timed out. Use a native debugger (like gdb) to debug the deadlock.");
            }
        }
    }

    let task_3 = tokio::spawn(async {
        let mut cnt = 1;
        loop {
            println!("Sleep {cnt} ...");
            tokio::time::sleep(Duration::from_millis(1)).await;
            println!("Wake up");
            cnt += 1;
        }
    });

    tokio::select!(
        _ = dump_or_quit() => {},
        _ = task_3 => {},
    );

    Ok(())
}

Command to run the example:

RUST_BACKTRACE=full RUSTFLAGS="--cfg tokio_unstable --cfg tokio_taskdump" cargo run --example dump

It panics after a few iterations:

...
Sleep 48 ...
Got tasks 262
Got tasks 263
Got tasks 264
Got tasks 265
Got tasks 266
Got tasks 267
Wake up
Sleep 49 ...
Got tasks 268
Got tasks 269
Got tasks 270
Got tasks 271
Got tasks 272
Got tasks 273
Got tasks 274
Wake up
Sleep 50 ...
thread 'tokio-runtime-worker' panicked at 'assertion failed: next.is_notified()', tokio/src/runtime/task/state.rs:118:13
stack backtrace:
   0:     0x55d10bf39501 - std::backtrace_rs::backtrace::libunwind::trace::he648b5c8dd376705
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55d10bf39501 - std::backtrace_rs::backtrace::trace_unsynchronized::h5da3e203eef39e9f
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55d10bf39501 - std::sys_common::backtrace::_print_fmt::h8d28d3f20588ae4c
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55d10bf39501 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd9a5b0c9c6b058c0
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55d10bf5ad2f - core::fmt::rt::Argument::fmt::h0afc04119f252b53
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/fmt/rt.rs:138:9
   5:     0x55d10bf5ad2f - core::fmt::write::h50b1b3e73851a6fe
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/fmt/mod.rs:1094:21
   6:     0x55d10bf367d7 - std::io::Write::write_fmt::h184eaf275e4484f0
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/io/mod.rs:1714:15
   7:     0x55d10bf39315 - std::sys_common::backtrace::_print::hf58c3a5a25090e71
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55d10bf39315 - std::sys_common::backtrace::print::hb9cf0a7c7f077819
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55d10bf3a843 - std::panicking::default_hook::{{closure}}::h066adb2e3f3e2c07
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:269:22
  10:     0x55d10bf3a5d4 - std::panicking::default_hook::h277fa2776900ff14
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:288:9
  11:     0x55d10bf3adc9 - std::panicking::rust_panic_with_hook::hceaf38da6d9db792
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:705:13
  12:     0x55d10bf3ac81 - std::panicking::begin_panic_handler::{{closure}}::h2bce3ed2516af7df
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:595:13
  13:     0x55d10bf39966 - std::sys_common::backtrace::__rust_end_short_backtrace::h090f3faf8f98a395
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:151:18
  14:     0x55d10bf3aa12 - rust_begin_unwind
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
  15:     0x55d10bdc1293 - core::panicking::panic_fmt::h4ec8274704d163a3
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
  16:     0x55d10bdc1323 - core::panicking::panic::hee69a8315e4031d6
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:117:5
  17:     0x55d10be45a55 - tokio::runtime::task::state::State::transition_to_running::{{closure}}::h95e630a37ee895e2
                               at /home/user/tokio/src/runtime/task/state.rs:118:13
  18:     0x55d10be46701 - tokio::runtime::task::state::State::fetch_update_action::h96feb951cebb7868
                               at /home/user/tokio/src/runtime/task/state.rs:474:34
  19:     0x55d10be1948a - tokio::runtime::task::state::State::transition_to_running::h8a2b2d6a1455fb43
                               at /home/user/tokio/src/runtime/task/state.rs:116:9
  20:     0x55d10bdd663f - tokio::runtime::task::harness::Harness<T,S>::poll_inner::h248818a42e4148bb
                               at /home/user/tokio/src/runtime/task/harness.rs:194:15
  21:     0x55d10bdd7ce3 - tokio::runtime::task::harness::Harness<T,S>::poll::h4ad6f9774e18a22d
                               at /home/user/tokio/src/runtime/task/harness.rs:153:15
  22:     0x55d10bdd092b - tokio::runtime::task::raw::poll::h4e10ccff17a6f4cf
                               at /home/user/tokio/src/runtime/task/raw.rs:276:5
  23:     0x55d10be06a37 - tokio::runtime::task::raw::RawTask::poll::he4ed16c81c86cc78
                               at /home/user/tokio/src/runtime/task/raw.rs:200:18
  24:     0x55d10be490a2 - tokio::runtime::task::LocalNotified<S>::run::hcd5581e1b71cbefd
                               at /home/user/tokio/src/runtime/task/mod.rs:400:9
  25:     0x55d10be0349d - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}::hed642f77845dcd91
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:576:13
  26:     0x55d10be756a4 - tokio::runtime::coop::with_budget::hb2baf3ffff0929af
                               at /home/user/tokio/src/runtime/coop.rs:107:5
  27:     0x55d10be756a4 - tokio::runtime::coop::budget::h6d5e2007b832a436
                               at /home/user/tokio/src/runtime/coop.rs:73:5
  28:     0x55d10be756a4 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h4fe289d4d8c1ae43
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:575:9
  29:     0x55d10be74c85 - tokio::runtime::scheduler::multi_thread::worker::Context::run::hb6816f625bf8d886
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:526:24
  30:     0x55d10be03389 - tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}::h28ada17212863590
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:491:21
  31:     0x55d10be9f2b0 - tokio::runtime::context::scoped::Scoped<T>::set::h6d1895efaf3eea8a
                               at /home/user/tokio/src/runtime/context/scoped.rs:40:9
  32:     0x55d10be1a2db - tokio::runtime::context::set_scheduler::{{closure}}::h1957ce6833a8f950
                               at /home/user/tokio/src/runtime/context.rs:176:26
  33:     0x55d10be66bce - std::thread::local::LocalKey<T>::try_with::h82e367c7cac01a03
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/local.rs:270:16
  34:     0x55d10be649bb - std::thread::local::LocalKey<T>::with::hd0d93a80ea7d416f
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/local.rs:246:9
  35:     0x55d10be1a1d4 - tokio::runtime::context::set_scheduler::h0d13417f5a845ec8
                               at /home/user/tokio/src/runtime/context.rs:176:9
  36:     0x55d10be03285 - tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::h949bdeefb0f6c46e
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:486:9
  37:     0x55d10be1aa78 - tokio::runtime::context::runtime::enter_runtime::ha60044a4cbaa90a3
                               at /home/user/tokio/src/runtime/context/runtime.rs:65:16
  38:     0x55d10be7470c - tokio::runtime::scheduler::multi_thread::worker::run::h4dafbe68e68f83b5
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:478:5
  39:     0x55d10be030db - tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}::h50bdc1b70101e192
                               at /home/user/tokio/src/runtime/scheduler/multi_thread/worker.rs:447:45
  40:     0x55d10bebac2e - <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll::h334944e66063c476
                               at /home/user/tokio/src/runtime/blocking/task.rs:42:21
  41:     0x55d10be6ca6d - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::he105471981f23ffb
                               at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  42:     0x55d10be1e034 - tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::h8df4c45233b0603c
                               at /home/user/tokio/src/runtime/task/core.rs:328:17
  43:     0x55d10be1d6cb - tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::h184352df67f52f2b
                               at /home/user/tokio/src/loom/std/unsafe_cell.rs:16:9
  44:     0x55d10be1d6cb - tokio::runtime::task::core::Core<T,S>::poll::h0abd0ec39b5c36a7
                               at /home/user/tokio/src/runtime/task/core.rs:317:13
  45:     0x55d10be970d5 - tokio::runtime::task::harness::poll_future::{{closure}}::h40c1a93b4e43f704
                               at /home/user/tokio/src/runtime/task/harness.rs:485:19
  46:     0x55d10beb81b3 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h608218ca4544e505
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panic/unwind_safe.rs:271:9
  47:     0x55d10be3d07c - std::panicking::try::do_call::hb800cf12f223711d
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:500:40
  48:     0x55d10be3e1db - __rust_try
  49:     0x55d10be3c708 - std::panicking::try::h99e6562138863caa
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:464:19
  50:     0x55d10be72e3b - std::panic::catch_unwind::hd8d54039d682b2c4
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panic.rs:142:14
  51:     0x55d10be96baf - tokio::runtime::task::harness::poll_future::h8cf1c3d01f2a1b03
                               at /home/user/tokio/src/runtime/task/harness.rs:473:18
  52:     0x55d10be94e99 - tokio::runtime::task::harness::Harness<T,S>::poll_inner::hbc159f4ab6ea18d8
                               at /home/user/tokio/src/runtime/task/harness.rs:208:27
  53:     0x55d10be948e7 - tokio::runtime::task::harness::Harness<T,S>::poll::h2e1a1a5f6baa8ce4
                               at /home/user/tokio/src/runtime/task/harness.rs:153:15
  54:     0x55d10be0562d - tokio::runtime::task::raw::poll::h4a595bf91c465f42
                               at /home/user/tokio/src/runtime/task/raw.rs:276:5
  55:     0x55d10be06a37 - tokio::runtime::task::raw::RawTask::poll::he4ed16c81c86cc78
                               at /home/user/tokio/src/runtime/task/raw.rs:200:18
  56:     0x55d10be49197 - tokio::runtime::task::UnownedTask<S>::run::h938ccfd06f9b14b3
                               at /home/user/tokio/src/runtime/task/mod.rs:437:9
  57:     0x55d10be6d727 - tokio::runtime::blocking::pool::Task::run::h5d1f965c15a0b6a7
                               at /home/user/tokio/src/runtime/blocking/pool.rs:159:9
  58:     0x55d10be6f259 - tokio::runtime::blocking::pool::Inner::run::ha64a01cb1f0779c9
                               at /home/user/tokio/src/runtime/blocking/pool.rs:513:17
  59:     0x55d10be3bb64 - tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}::he2ba3ec07c196481
                               at /home/user/tokio/src/runtime/blocking/pool.rs:471:13
  60:     0x55d10be47af9 - std::sys_common::backtrace::__rust_begin_short_backtrace::h9786565606a7a88c
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:135:18
  61:     0x55d10be9b7b2 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h056680a07d499a88
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/mod.rs:529:17
  62:     0x55d10beb83e2 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hed939398bf026d3a
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panic/unwind_safe.rs:271:9
  63:     0x55d10be3d03b - std::panicking::try::do_call::ha579ad7387e57215
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:500:40
  64:     0x55d10be3e1db - __rust_try
  65:     0x55d10be3c2b1 - std::panicking::try::h46c143efe7d14943
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:464:19
  66:     0x55d10be9b5c2 - std::panic::catch_unwind::h788ee0795e21e252
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panic.rs:142:14
  67:     0x55d10be9b5c2 - std::thread::Builder::spawn_unchecked_::{{closure}}::hf11b55b64b8ec3de
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/mod.rs:528:30
  68:     0x55d10be080df - core::ops::function::FnOnce::call_once{{vtable.shim}}::h312502b7bb1c5fff
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5
  69:     0x55d10bf3d295 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hc0b1022758ecac73
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/boxed.rs:1993:9
  70:     0x55d10bf3d295 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h0c9654ebe7ad657e
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/boxed.rs:1993:9
  71:     0x55d10bf3d295 - std::sys::unix::thread::Thread::new::thread_start::h04c8e9c7d83d3bd5
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys/unix/thread.rs:108:17
  72:     0x7fc2ca93a609 - start_thread
                               at /build/glibc-BHL3KM/glibc-2.31/nptl/pthread_create.c:477:8
  73:     0x7fc2ca70a133 - clone
                               at /build/glibc-BHL3KM/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
  74:                0x0 - <unknown>
worker thread panicking; aborting process
Aborted (core dumped)
@DmitrySamoylov DmitrySamoylov added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Oct 5, 2023
@Darksonn Darksonn added the M-taskdump --cfg tokio_taskdump label Oct 5, 2023
@Darksonn
Copy link
Contributor

Thanks for reporting this. I think I know what causes it.

@jswrenn jswrenn self-assigned this Dec 4, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 5, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 5, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
Previously, we attempted to trace tasks that were already notified,
leading to panics.

This PR modifies `transition_to_notified_for_tracing` (and its
callers) to be fallible, thus avoiding this issue.

Fixes tokio-rs#6051
jswrenn added a commit to jswrenn/tokio that referenced this issue Dec 6, 2023
Previously, we attempted to trace tasks that were already notified,
leading to panics.

This PR modifies `transition_to_notified_for_tracing` (and its
callers) to be fallible, thus avoiding this issue.

Fixes tokio-rs#6051
@jswrenn
Copy link
Contributor

jswrenn commented Dec 7, 2023

Tremendous apologies for the delay in fixing this. We should have a fix merged soon, but if you'd like to test it out in the meantime, check out #6194!

@scottlamb
Copy link

I thought this was fixed with your change above, but I saw a similar error today with tokio 1.35.0 in a proprietary app.

thread 'tokio-runtime-worker' panicked at 'assertion failed: next.is_notified()', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.35.0/src/runtime/task/state.rs:118:13

Looks like the task dump did complete; I guess it was worker threads that panicked rather than the caller of the dump code.

Unfortunately this was a rare production thing, and I don't know how to reproduce it. Any suggestions for what info might be useful in understanding the problem, either retroactively (anything I can dig out of those dumps that might be helpful?) or something we can log for next time?

This may not be relevant at all, but the dump in question fired because we had some sort of yet-undiagnosed memory leak, then started to bog down with paging in the program binary, then failed container health checks, and have it configured to fire the dump on SIGTERM. I'm planning to set up heap profiling to figure out the memory leak. So far I know nothing—the memory leak could be something totally unrelated in my app or a third-party C++ library or could also be a tokio bug. Who knows.

@Darksonn
Copy link
Contributor

The full backtrace would be useful if you have it. Could you file a new bug saying that this assert can still be triggered on 1.35.0?

And you are sure that this happened in relation to a taskdump? It's not some unrelated thing going on?

@scottlamb
Copy link

Could you file a new bug saying that this assert can still be triggered on 1.35.0?

#6343, with a full stack trace in the attachment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-taskdump --cfg tokio_taskdump
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants