-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement task dumps for current-thread runtime. #5608
Conversation
b9caa5f
to
5ff26f4
Compare
frames.push(frame.to_owned().into()); | ||
} | ||
|
||
if ptr::eq(frame.symbol_address(), Self::leaf as *const _) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are failing on MacOS because the unwinder never produces a symbol address exactly equalling Self::leaf
— it's off by twelve bytes. I don't yet know why this occurs, or if the 12 byte offset is reasonably consistent or an artifact of codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened an issue on backtrace-rs seeking further clarification: rust-lang/backtrace-rs#520.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue occurs because:
On macOS the function to get the address of the enclosing function of an ip address (
_Unwind_FindEnclosingFunction
) is unreliable due to compact unwind info collapsing multiple functions with identical unwind info together
...and is thus not used. Consequently, unwinding on macOS with backtrace-rs
currently always emits the same value for symbol address and instruction pointer.
There are a few ways we might work around this (or, even, fix the issue in backtrace-rs
), but in the short term I think we should probably just drop macOS support for taskdumping from this initial PR.
// set the notified bit | ||
let _ = task.as_raw().state().transition_to_notified_for_tracing(); | ||
// store the raw tasks into a vec | ||
tasks.push(task.as_raw()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably encapsulate this, because I don't want ref-count logic to happen outside of src/runtime/task
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by 88ef47e?
/// A [`Frame`] in an intrusive, doubly-linked tree of [`Frame`]s. | ||
struct Frame { | ||
/// The location associated with this frame. | ||
inner_addr: *const c_void, | ||
|
||
/// The parent frame, if any. | ||
parent: Option<NonNull<Frame>>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this linked tree? Is there any reason we can't just use a Vec<*const c_void>
in the context to store this information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frame stack is a very hot path — the linked Frame
approach aids with memory locality and eliminates allocations. Allocations are not particularly an issue in this PR, where the stack depth never exceeds 1 (for the root frame), but in a follow-up PRs I'll be using this Frame
stack to associate information (like time durations, request URLs, etc) with actual stack frames.
Task dumps are snapshots of runtime state. Taskdumps are collected by instrumenting Tokio's leaves to conditionally collect backtraces, which are then coalesced per-task into execution tree traces. This initial implementation only supports collecting taskdumps from within the context of a current-thread runtime, and only `yield_now()` is instrumented.
Taskdumps depend on backtrace-rs, whose unwinding behavior is platform-dependent. Taskdumps should not be enabled on a platform without first assessing and mitigating platform-dependent behavior. ref: https://github.com/tokio-rs/tokio/pull/5608/files#r1162746834
This version gives us `From<Frame> for BacktraceFrame`.
Taskdumps depend on backtrace-rs, whose unwinding behavior is platform-dependent. Taskdumps should not be enabled on a platform without first assessing and mitigating platform-dependent behavior. ref: https://github.com/tokio-rs/tokio/pull/5608/files#r1162746834
02a0c4c
to
c824389
Compare
Happy to say I've resolved your comments, @Darksonn, and got this passing CI. |
Namely, when it is used without `--cfg tokio_unstable`, or when it is used on an unsupported target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I left several comments that are not required to apply before merging the PR, but we should create a "stabilization" tracking issue where we can keep track of remaining open questions.
@@ -252,6 +252,14 @@ impl Handle { | |||
/// [`tokio::time`]: crate::time | |||
#[track_caller] | |||
pub fn block_on<F: Future>(&self, future: F) -> F::Output { | |||
#[cfg(all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't require this change for the PR to land, but generally, these inline cfg flags are hard to maintain. I tend to prefer using cfg to define APIs and then using the API regardless of the cfg in code like this. For example, Trace::root
would always be defined, but when the task dump capability is not enabled, it would be a no-op and return the argument without modifying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, many of these cfgs could be reduced to a single #[cfg(tokio_taskdump)]
since lib.rs
contains a compile_error!
if you have #[cfg(tokio_taskdump)]
without the other options also set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that reducing these to just #[cfg(tokio_taskdump)]
wouldn't prevent errors in these locations under odd configurations, and thus would drown out the lone explicit compile_error
in lib.rs
. I think Carl's suggestion is probably the way to go.
@@ -233,6 +233,11 @@ cfg_rt! { | |||
mod defer; | |||
pub(crate) use defer::Defer; | |||
|
|||
cfg_taskdump! { | |||
pub mod dump; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical for this PR, but we probably want to reconsider the API details before stabilization.
Today, tokio::runtime
does not have any public submodules. Instead, it is flat. That would apply to task dumps as well. We would need to spend a bit of time bikeshedding type names, though.
tokio/src/runtime/task/mod.rs
Outdated
target_os = "linux", | ||
any(target_arch = "aarch64", target_arch = "i686", target_arch = "x86_64") | ||
))] | ||
pub(crate) fn as_raw(&self) -> RawTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What safety considerations, if any, are there for the caller when invoking this method? (Not a rhetorical question, I don't actually remember 😆 )
|
||
let mut traces = vec![]; | ||
|
||
// todo: how to make this work outside of a runtime context? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will most likely have to send a signal somehow.
let _ = injection.drain(..); | ||
|
||
// notify each task | ||
let mut tasks = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an initial implementation, this is fine. Before stabilizing, we should consider how this will work when many tasks spawned onto the runtime.
I believe I've addressed all outstanding issues blocking this PR. I've also added a "Progress" section to the feature request to track progress on the task dump feature. |
tokio/src/lib.rs
Outdated
tokio_taskdump, | ||
feature = "rt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would trigger this compile error even without rt
. You aren't going to be enabling --cfg tokio_taskdump
without knowing about it. Possibly also compile-fail on --cfg tokio_taskdump
without rt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly also compile-fail on
--cfg tokio_taskdump
without rt.
Hm, unfortunately, this causes both a CI failure an a usability issue: You might just keep --cfg tokio_unstable --cfg tokio_taskdump
in your environment (.cargo/config makes this easy to do), but not always want to actually use the taskdumping feature. In such scenarios, your build will fail because of the explicit compile_error!
.
Consequently, I've reverted my attempt at adding this compile_error: 8318d93
@@ -252,6 +252,14 @@ impl Handle { | |||
/// [`tokio::time`]: crate::time | |||
#[track_caller] | |||
pub fn block_on<F: Future>(&self, future: F) -> F::Output { | |||
#[cfg(all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, many of these cfgs could be reduced to a single #[cfg(tokio_taskdump)]
since lib.rs
contains a compile_error!
if you have #[cfg(tokio_taskdump)]
without the other options also set.
} | ||
|
||
/// Trace and poll all tasks of the current_thread runtime. | ||
pub(in crate::runtime) fn trace_current_thread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the trace_current_thread function more readable and easier to understand? I think we should encapsulate some of the logic into separate functions to improve the function's clarity and maintainability.
Something like this (no tested)
/// Notify all tasks and return a vector of their raw pointers.
fn notify_all_tasks(owned: &OwnedTasks<Arc<current_thread::Handle>>) -> Vec<NonNull<task::RawTask>> {
let mut tasks = vec![];
owned.for_each(|task| {
task.as_raw().state().transition_to_notified_for_tracing();
tasks.push(task.as_raw());
});
tasks
}
/// Collect the traces for all tasks and return a vector of `Trace` objects.
fn collect_task_traces(tasks: Vec<NonNull<task::RawTask>>) -> Vec<Trace> {
tasks
.into_iter()
.map(|task| {
let ((), trace) = Trace::capture(|| task.poll());
trace
})
.collect()
}
/// Trace and poll all tasks of the current_thread runtime.
pub(in crate::runtime) fn trace_current_thread(
owned: &OwnedTasks<Arc<current_thread::Handle>>,
local: &mut VecDeque<Notified<Arc<current_thread::Handle>>>,
injection: &mut VecDeque<Notified<Arc<current_thread::Handle>>>,
) -> Vec<Trace> {
local.clear();
injection.clear();
let tasks = notify_all_tasks(owned);
collect_task_traces(tasks)
}
|
||
impl Tasks { | ||
/// Iterate over tasks. | ||
pub fn iter(&self) -> impl Iterator<Item = &Task> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible improvement here is to use impl Trait to simplify the definition of methods that return iterators. For example:
pub fn iter(&self) -> impl Iterator<Item = &Task> { | |
pub fn iter(&self) -> impl Iterator<Item = &Task> + '_ { |
'_ means that the lifetime of the returned iterator is bound to the lifetime of the Tasks object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTM (Looks good to me), just some comments and minor suggestions.
.github/workflows/ci.yml
Outdated
- name: check --feature-powerset --unstable | ||
run: cargo hack check --all --feature-powerset --depth 2 -Z avoid-dev-deps --keep-going | ||
env: | ||
RUSTFLAGS: --cfg tokio_unstable -Dwarnings | ||
RUSTFLAGS: --cfg tokio_unstable --cfg tokio_taskdump -Dwarnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the test designed to test various combinations of flags, I suggest we also try with just --cfg tokio_unstable
.
# Try with unstable feature flags
- name: check --feature-powerset --unstable
run: cargo hack check --all --feature-powerset --depth 2 -Z avoid-dev-deps --keep-going
env:
RUSTFLAGS: --cfg tokio_unstable -Dwarnings
# Try with unstable and taskdump feature flags
- name: check --feature-powerset --unstable --taskdump
run: cargo hack check --all --feature-powerset --depth 2 -Z avoid-dev-deps --keep-going
env:
RUSTFLAGS: --cfg tokio_unstable --cfg tokio_taskdump -Dwarnings
target_os = "linux", | ||
any(target_arch = "aarch64", target_arch = "x86", target_arch = "x86_64") | ||
))] | ||
crate::runtime::task::trace::Trace::leaf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR currently only marks yield_now
as a leaf. That's fine, but I would like more futures to be included in the first release with taskdump support.
Since we just made a Tokio release, I think we can merge this now, since we will have plenty of time to add additional functions as leaves.
Resolves a semantic conflict between tokio-rs#5608 and tokio-rs#5628. ref: https://discord.com/channels/500028886025895936/500336346770964480/1100809662208675982
This PR implements task dumps on the multi-thread runtime. It complements tokio-rs#5608, which implemented task dumps on the current-thread runtime.
This PR implements task dumps on the multi-thread runtime. It complements tokio-rs#5608, which implemented task dumps on the current-thread runtime.
This PR implements task dumps on the multi-thread runtime. It complements tokio-rs#5608, which implemented task dumps on the current-thread runtime.
This PR implements task dumps on the multi-thread runtime. It complements tokio-rs#5608, which implemented task dumps on the current-thread runtime.
This PR implements task dumps on the multi-thread runtime. It complements tokio-rs#5608, which implemented task dumps on the current-thread runtime.
This patch implements task dumps on the multi-thread runtime. It complements #5608, which implemented task dumps on the current-thread runtime.
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dependencies | minor | `1.28.2` -> `1.29.1` | | [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dev-dependencies | minor | `1.28.2` -> `1.29.1` | --- ### Release Notes <details> <summary>tokio-rs/tokio (tokio)</summary> ### [`v1.29.1`](https://github.com/tokio-rs/tokio/releases/tag/tokio-1.29.1): Tokio v1.29.1 [Compare Source](tokio-rs/tokio@tokio-1.29.0...tokio-1.29.1) ##### Fixed - rt: fix nesting two `block_in_place` with a `block_on` between (#​5837]) #​5837]: tokio-rs/tokio#5837 ### [`v1.29.0`](https://github.com/tokio-rs/tokio/releases/tag/tokio-1.29.0): Tokio v1.29.0 [Compare Source](tokio-rs/tokio@tokio-1.28.2...tokio-1.29.0) Technically a breaking change, the `Send` implementation is removed from `runtime::EnterGuard`. This change fixes a bug and should not impact most users. ##### Breaking - rt: `EnterGuard` should not be `Send` (#​5766]) ##### Fixed - fs: reduce blocking ops in `fs::read_dir` (#​5653]) - rt: fix possible starvation (#​5686], #​5712]) - rt: fix stacked borrows issue in `JoinSet` (#​5693]) - rt: panic if `EnterGuard` dropped incorrect order (#​5772]) - time: do not overflow to signal value (#​5710]) - fs: wait for in-flight ops before cloning `File` (#​5803]) ##### Changed - rt: reduce time to poll tasks scheduled from outside the runtime (#​5705], #​5720]) ##### Added - net: add uds doc alias for unix sockets (#​5659]) - rt: add metric for number of tasks (#​5628]) - sync: implement more traits for channel errors (#​5666]) - net: add nodelay methods on TcpSocket (#​5672]) - sync: add `broadcast::Receiver::blocking_recv` (#​5690]) - process: add `raw_arg` method to `Command` (#​5704]) - io: support PRIORITY epoll events (#​5566]) - task: add `JoinSet::poll_join_next` (#​5721]) - net: add support for Redox OS (#​5790]) ##### Unstable - rt: add the ability to dump task backtraces (#​5608], #​5676], #​5708], #​5717]) - rt: instrument task poll times with a histogram (#​5685]) #​5766]: tokio-rs/tokio#5766 #​5653]: tokio-rs/tokio#5653 #​5686]: tokio-rs/tokio#5686 #​5712]: tokio-rs/tokio#5712 #​5693]: tokio-rs/tokio#5693 #​5772]: tokio-rs/tokio#5772 #​5710]: tokio-rs/tokio#5710 #​5803]: tokio-rs/tokio#5803 #​5705]: tokio-rs/tokio#5705 #​5720]: tokio-rs/tokio#5720 #​5659]: tokio-rs/tokio#5659 #​5628]: tokio-rs/tokio#5628 #​5666]: tokio-rs/tokio#5666 #​5672]: tokio-rs/tokio#5672 #​5690]: tokio-rs/tokio#5690 #​5704]: tokio-rs/tokio#5704 #​5566]: tokio-rs/tokio#5566 #​5721]: tokio-rs/tokio#5721 #​5790]: tokio-rs/tokio#5790 #​5608]: tokio-rs/tokio#5608 #​5676]: tokio-rs/tokio#5676 #​5708]: tokio-rs/tokio#5708 #​5717]: tokio-rs/tokio#5717 #​5685]: tokio-rs/tokio#5685 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi4wLjAiLCJ0YXJnZXRCcmFuY2giOiJkZXZlbG9wIn0=--> Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1958 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Motivation
In Java, thread dumps are an invaluable tool for debugging a stuck application in production. When a user requests a threaddump from a JVM, they receive text including the following information about each thread managed by the JVM:
RUNNABLE
,WAITING
, orBLOCKED
.This information is useful for debugging deadlocks.
Thread dumps provide a starting point for debugging deadlocks. A programmer can thread dump a deadlocked program, identify the
BLOCKED
threads, and inspect their call stacks to identify the resources they are blocked on.This PR is a first step towards bringing an analogous functionality to tokio: "task dumps". A task dump is a comprehensive snapshot of the state of a Tokio application. This proposal will allow tokio services to briefly pause their work, reflect upon their internal runtime state, and comprehensively report that state to their operator.
Solution
This PR implements a task dump mechanism for the current-thread runtime, that provides execution traces of each task. Execution traces are produced by instrumenting tokio's leaves to conditionally capture backtraces, which are then coalesced into a tree for each task.
For example, executing this program:
Will produce an output like:
Limitations
unimplemented!()
LocalSet
s orFuturesUnordered
.These limitations will be addressed in future PRs.
NOTE: The initial state of this PR only instruments
yield_now
. I will similarly instrument other leaf futures after this PR passes initial review.Related Reading
This PR addresses this feature proposal.
This feature is tracked by this stabilization tracking issue.
This document contrasts several potential approaches to tracing future state. This PR implements "📦 Backtrace the Leaves".