-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokio Util LongRunningTaskDetector. (https://github.com/tokio-rs/console/issues/150) #6256
Conversation
Before you implement a significant large new feature, it is really important that you talk to us first. This is crucial to ensure that you don't waste a lot of time on something that can't be merged. The proposal here appears to have significant overlap with the taskdump proposal. However, it is special-cased to also include a detection mechanism for blocking tasks, and to trigger the task dump only when that detection triggers. I think that for something like this, it makes more sense to separate the task dump functionality from the detection of blocking tasks, so that the task dumps can be used in other cases as well. The specific way that this is implemented will not work for the reasons that @sfackler points out. However, when @jswrenn investigated the different ways that task dumps can be implemented, he outlined one approach that is similar to this. However, for the reasons outlined in this document, we used a different implementation in the end. That said, I think it makes sense to add something similar to this in the future to handle "very stuck tasks", as is necessary for getting stack traces into tasks that are blocking the thread like you are doing here. Though, I'm not sure if we are ready to do that right now. At least, there are other improvements in the taskdump area that would be higher priority to me. You may want to open a thread on our discord channel, or a new issue, for discussion of the best way forward here. |
Thanks for your feedback, and for the extra info. This is a "RFC PR" of some fun I had on a flight today. Even if this will not be merged, I would not consider it a waste of time :-). I agree with separating the thread stack dump from the detection mechanism. I was not able to quickly find a good way to dump stack traces in rust for a bunch of threads, but if anyone can point me to a solution I would be happy to update the implementation. (or merge a PR against this PR) |
This is the proposal that @jswrenn suggested for that:
This was taken from the latest edit on Feb 15 of the top-level post in #5457. (It has since been edited.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to move forward with the taskdump component of this, then I think that the best way to do that would be to modify the taskdumps in the main Tokio crate to have special handling for the case where there is a very stuck task that isn't returning from poll
. In that case, it would use signals or ptrace to get a stacktrace into that task, instead of the usual mechanism for tasks that are well-behaved.
It looks like you moved the long-running task detector into the Tokio crate, and removed the code for capturing taskdumps of blocking tasks. My previous comment was mostly referring to taskdumps of blocking tasks. I don't think the long-running task detector is a good fit for the main Tokio crate. We generally try to keep debugging features outside of Tokio when that is possible, and in this case, that is possible. Perhaps it would be a better fit for tokio-metrics, or an entirely new crate instead? |
I am still experimenting with the task dump code, but based on what I am seeing I am not sure it is practical for LRTD due to the fact that LRTD does detect "very stuck tasks" (and it is important that it does) and from what I see at that is not something supported? I think a "traditional stack trace" obtained via a productionized version of what I use in one of the unit tests is the way to go. (This can happen at a later stage)
I agree, lrtd is an independent component, I can easily move it to a separate crate, does not really fit into a "metrics" crate. |
Right, we do not support detection of very stuck tasks right now, and as the article points out, it isn't as critical as you can debug them with gdb. But I do think that it would be an interesting feature to have (with one of the implementations outlined in #6256 (comment)). If nothing else, we probably need to do something more intelligent than just hanging the task dump forever like we do today.
If it's not in the main Tokio crate, then I don't object to the use of the rand crate. Or, of course, you can just copy its implementation. |
@Darksonn I am thinking of best location for now is tokio-util as a lrtd feature. Not sure it is worth creating a separate crate for this... What do you think? The implementation allows a custom handler to be invoked on detection, so any future utility/approach can be easily used. |
I think that the reality is that none of our crates have anything similar to your lrtd utility, and there's no great place to put it. I don't think tokio-util is a good fit. It's a crate with utilities for writing applications, not debugging tools. Your original motivation was to fix tokio-rs/console#150. Perhaps it makes more sense to pursue a lint in tokio-console? A lint in that tool would be much less "ad hoc" and more easily discoverable, since anyone using tokio-console would get it automatically. Furthermore, tokio-console is already my standard recommendation for debugging blocking tasks. Alternatively, the tokio-metrics crate provides something called task metrics, and it would be useful to have a task metric for whether a specific task is blocking. Of course, I acknowledge that this would require an entirely different implementation, and that you have already spent time on the current one ... but something using a different implementation based on task metrics would be a good fit for tokio-metrics. As it stands right now, I don't think I can accept the current implementation strategy for detecting long running tasks anywhere in Tokio. If you want to publish it, you will need to make your own crate for it. Therefore, I will close this PR. If you want to pursue an implementation of capturing taskdumps for blocking tasks, then please create a new PR for that. An implementation of that is welcome and I would be happy to review it. |
I will provide some extra context here on how we have been using LRTD (for python and c++) for the past few years. We are using it continuously in production, and some of the decisions (on top of emitting debug detail) one might take on blocking detection is to report the node unhealthy so that the container can be recycled sooner. The emitted stack trace details enable owners to quickly identify the issue and push out a fix, further enhancing the scalability and responsiveness of their services. The blocking event counts are used to test the health of the canary tier, etc... (I see here your point about tokio-metrics) Lrtd is designed to have very small(configurable) overhead so that it can be safely run in production, it is a little bit more than a debug tool. @Darksonn Right now I am thinking tokio-metrics, let me know what you think... |
Will continue this in: tokio-rs/tokio-metrics#57 |
Thanks for the proof of concept. I do think solving the "detect blocking tasks" feature would be nice to get. I looked through this, and if I understand correctly, it will detect if all runtime threads are 100% blocked, but if a single thread is blocked, it won't. It also doesn't detect tasks that "block" for short periods (say 100ms). Ideally, we will develop a strategy that handles both of these cases, and I believe it will require work in the Tokio runtime itself. At a high level, an epoch-based strategy will work best. Each worker thread has a monotonically increasing number that represents its epoch. The worker thread will increment its epoch every so often 1. There will then be a "monitoring" thread that periodically checks that each worker's epoch increases regularly. This strategy would at least detect that there are tasks with slow poll times with minimal overhead. Identifying the exact ones that are slow would require some additional work. However, your utility to detect a blocked runtime is generally useful. It would be better to release it as your crate instead of pulling it in tokio-util or tokio-metrics, as we have limited maintenance bandwidth and want a more comprehensive solution in the long term. If you'd like to work on the runtime to detect slow polls, please try pinging me on Discord, and we can talk more about that. Footnotes
|
Motivation
The motivation for this pull request is to provide a solution for the problem described at: tokio-rs/console#150
Blocking calls in the async workers lead to scalability issues, being able to detect them so that code can be fixed is very important for the scalability of an async-io service.
Solution
LongRunningTaskDetector uses a sampling black box approach to detect when processing in the tokio runtime in blocked for more than a configurable amount.
When blocking is detected the stack traces of the runtime worker threads can be dumped (action is plugable), to allow us to identify what is happening in the worker threads.
On detection, details can contain precise info about the blocking culprit and will look like:
This is the output from the unit tests (run_blocking_stuff).
Not sure I put this in the right place, and there is likely some polishing left to do, but this should be good enough for some initial feedback.