-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThreadPool Blocking Mitigation #47366
Conversation
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs
Outdated
Show resolved
Hide resolved
Test ASP.NET Core endpoints: public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
app.UseRouting();
app.UseEndpoints(endpoints =>
{
endpoints.MapGet("/blocktp", context =>
{
var random = new Random();
Task.Delay(random.Next(50, 500)).Wait();
context.Response.WriteAsync("Hello World!").Wait();
return Task.CompletedTask;
});
endpoints.MapGet("/async", async context =>
{
var random = new Random();
await Task.Delay(random.Next(50, 500));
await context.Response.WriteAsync("Hello World!");
});
});
} Without the mitigation
With the mitigation
|
I'll take a look at the code on Monday, but just from your description, lots of warning bells going off in my head ;-) |
3335360
to
f504dbc
Compare
Not denying that 😅 Mostly reduces the chance of becoming irrecoverable; rather than making blocking on the ThreadPool a performant option. e.g. currently if 3 blocking tests are run in sequence they all fail; as does an async one after:
Whereas after; still get timeouts for the blocking and fails at the start (as all threads start blocked); but it then recovers and the async still runs
|
Would the tracking of call sites interfere with AssemblyLoadContext.Unload? |
This looks really promising. I think the reflection part needs to be solved but the idea that we would attempt to educate the thread pool about blocking via calls that pass through the the BCL itself would handle a large set of problems. We may even have something like this behind an app context switch if others are uncomfortable with it being on by default. |
src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs
Outdated
Show resolved
Hide resolved
Solved... Forgot it was the runtime and could just add |
private static object? t_currentWorkItem; | ||
|
||
private static ThreadPoolBlockingQueue? s_processor; | ||
private static HashSet<object>? s_blockingItems; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any concern about how big this can get? I guess it's a function of the code since we're looking at types not work item instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its only recording the "methods" (which includes statemachine Type
s for async methods) that initiate the blocking call stack so hopefully should be of small size; if not then the app will have bigger issues...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are OS scheduler hiccups or little bursts of activity end up adding innocent victims to this list? Can this lead to close to everything that the app runs being in this list eventually, once the app runs for a while?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code will need to be at some point explicitly making a call to Task.Wait()
, Task.Result
, Task.WaitAll(...)
, .GetAwaiter().GetResult()
(or ValueTask
variants) that doesn't have a result; so not sure they are innocent 😉
Regular OS blocking/sync calls that don't block a Task shouldn't effect it (e.g. .Read()
); but that's not so problematic as they will eventually unblock, whereas "sync-over-async"; which this aims to address, needs a second thread to unblock and if there's no spare on the ThreadPool and they are being blocked as fast as they are being injected they will never unblock.
The main down side is it marks the entry point from the last yield (as it doesn't break the stack); so anything branching off that same yield point will be effected; until it itself yields when it will go back to a regular ThreadPool item.
However it does mean mitigation patterns like:
await Task.Run(() => BlockingCall());
Will actually do something; scoping the ThreadPool mitigation at the lambda level rather than further back; and adding a yield
await Task.Yield();
BlockingCall();
Will again scope to the mitigation to async method rather than further back in the stack.
This could be for example calling Azure KeyVault async methods via the System.Security.Cryptography
classes where they have no async apis; so would need to be .GetAwaiter.GetResult()
behind the scenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should think a bit about what we can do to make the finger printing a little better (can we do anything at JIT time or runtime) and try to figure out if we need to have a heuristic with how to remove these blocking call sites from this list. That can be improved over time (I can think of some crazy ideas...) but we just need to make sure that the idea is sound (I think it is).
It's already currently scoped to sync over async which is think is the big thing we need to tackle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can self heal these blocking call sites by keeping some metrics of blocking call site execution and removing it from the list if multiple executions complete within some threshold (spitballing here) or don't make the same blocking call on re-execution.
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs
Outdated
Show resolved
Hide resolved
One problematic area are work items that batch items (basically hiding this knowledge from the thread pool. Like the IOQueue in kestrel and the Linux sockets implementation). The one to many relationship means that one bad inner work item causes all of them to be put into the blocking list. We'd need to detect these types of work items or give them a way to opt out or participate themselves in blocking detection |
c619e36
to
31b5d5e
Compare
Added test (from https://gist.github.com/benaadams/9290257de1c907b3ecafc750779e2621)
|
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolBlockingQueue.AnyOS.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolBlockingQueue.AnyOS.cs
Outdated
Show resolved
Hide resolved
Thanks for suggesting an approach here. I skimmed the solution. On the one hand, it’s clever, and it has a nice impact on the hand-crafted problematic cases cited in the PR. On the other hand, I have a variety of initial concerns:
|
See (#47366 (comment)). If the identification of the workitem is too coarse grained then it can be problematic. I see that as a problem but maybe one we can come up with creative solutions to (or maybe we need to opt out in those cases)
This is problematic and concerns me as well. Ben and I spoke about this a little but haven't come up with anything as yet. I have some crazy ideas that would require @jkotas approval 😄.
Temporary deadlocks because the thread pool would grow anyways? This commit, requests a new thread f5702e8 which should signal to the thread pool that work that more threads are needed.
I feel like if you can solve the hashset problem, you can potentially solve this as well. We need a fast way to identify if a workitem is potentially blocking. Are there any free bits in the object header 😄?
Right but it aims at dramatically improving a problem that our customers have that isn't going away. Lots of code we have like this has these types of arbitrary thresholds (sometimes without knobs...)
👍🏾 |
Could drop QUWIs from it; as they are explicit calls so not necessarily in the common path; and it would still pick up |
The current issue is one of education. Solving that by introducing a new problem that requires education isn't great.
Why would it grow? The starvation detection mechanism isn't paying attention to this, is it? As far as it's concerned, the work items that were queued were picked up successfully. In fact it seems like this could actually decrease the number of threads that get injected if hill climbing sees the work items that check IsRequired and immediately exit as successful completions. |
I don't know this part of the code base well but it seems like it could be changed to know right? |
I don't now how widespread this would need to be, but it would require a public API. This novel batching work item thing we do basically hides information from the thread pool into our own queue because there's more overhead queueing directly to the threadpool. It's logical that something to help detect blocking at the working level is inadequate because we're lying to the threadpool 😢 |
Yes; and currently if it took more than 500 ms making 4 rps would DoS the server as that's faster than the thread injection rate (increase the rate if the blocking takes less time). If the blocking can't be avoided (hidden behind an api) then adding
or to tag the method itself in metadata rather than use a HashTable?
They can be made more aware; periodically scanning
Removed that as the common issue (from ASP.NET anyway) is probably on a Task/ValueTask root rather than QUWI.
Moved to method so both now use the same.
Only one is
Definitely does! Which is a common approach in many runtime solutions (rust, jvm, scala ZIO etc); 3 thread pools: 1 I/O, 1 CPU, 1 Blocking. .NET already effectively has an I/O thread pool; and hill climbing mostly solves the issue with regular blocking OS calls being on same thread pool as CPU items; as it will eventually inject another thread and those items unblock themselves. The issue in .NET is with "sync-over-async" as they do not unblock themselves and need another thread so they must be rate limited below the rate of thread pool injection (assuming don't get fancy and having heap thread stacks; so they can be removed off the threads and reattached when ready via stack bending #4654; like golang, java's project loom or Windows' fibres) The issue with a blocking threadpool elsewhere; whether spawn_blocking in rust tokio's scheduler or Scala's ZIO effectBlocking; is you need to know upfront before calling that its going to be blocking then use that construct to queue it to the blocking pool (much like you'd do in .NET with a concurrency limited TaskScheduler). However; its hard to know ahead of time that there is a The approach here is to automatically detect those problematic paths and then try to mitigate them; by moving them to effectively a side pool. Of course the implementation can probably be refined in various ways... |
I think that detecting sync-over-async waiting and doing something when it is detected has as a merit as an idea. Blacklisting currently executing threadpool workitem sounds problematic as @stephentoub suggested. Can we do something less problematic instead? For example, can we tell the threadpool to take the thread stuck on sync-over-async out of circulation temporarily to inject new threads more aggressively? |
Effectively increase |
Does this reduce the degree of parallelism for workloads that make use of blocking waits? That seems very undesirable. Blocking waits do occur in real-world apps and are, in some cases, a good solution. For example, to implement an interface it might be required to block. Or, blocking happens in an operation that does not consume many threads and you want to avoid infecting the entire call chain so you just block on a task. It would be unfortunate if such workloads were to suffer. |
Workloads like that block thread pool threads already suffer. This technique tries to blame the right stack which is hard to do but I think it's extremely promising as it does allow you to manually break up the stack to workaround things (Task.Run(...).Wait()) |
This is generally true for scalable request/response workloads like you find in a typical ASP.NET app. There are app types where it's not true, single user, where blocking is used for simplicity, etc, and the proposed solution will throttle such workloads artificially. |
OK so a single user application might want to opt out of this so that it can block without being throttled by the limited concurrency. Just so I have a better picture of this, what does the code look like in a single user app that's using blocking waits on thread pool threads parallelism but has enough work to observe problems with limited concurrency? |
I don't have a comprehensive answer for that but we do have a desktop app that occasionally hits it. It's an email client that can have multiple accounts connected to different servers. For servers that communicate over IMAP protocol we rely on pretty complex logic that involves System.IO.Pipelines and occasionally due to API constrains we resort to some sync-over-async patterns (with cancellation tokens that would interrupt the operation if necessary). The number of such task scales with the number of accounts that are setup for synchronization in the application. Not all servers behave the same so even though some of these operations can be performed really quickly on one server they may block on a different one. |
Thanks for that. I don't think this approach is bullet proof by any means but I think it gives some insight to what we might be able to accomplish and explore scenarios that end up being punished unnecessarily. I wonder if there's a staged approach here where there are thresholds for being put into the limited concurrency queue (lets call it the sunken place) and another threshold to get of that said place. I could see that mitigating some of the problems described here (though there's a slew of other problems), though it might lead to thrashing if your code ends up in and out of this list too frequently. |
1d3d6ba
to
71f6bdd
Compare
Notifying the ThreadPool of executing work (it didn't do originally) looks to allow Also changed to disengage when the blocking pressure has subsided. So mechanism changed to:
*Concurrency is limited to |
Many workloads for which tasks were originally introduced, actually. Consider, for example, parallelized divide and conquer. A worker partitions its work into multiple pieces, queues identical workers to process N - 1 of those pieces, processes its own, and then waits for those other workers to complete. Also think about algorithms that require coordination between those pieces, such as various operations used by PLINQ (e.g. OrderBy) that needs to block each of the workers until they reach a common point so that they can exchange data and all continue on, and what would happen if one of the workers was throttled indefinitely.
I think that's the wrong direction, at least for the foreseeable future. If we decided to go with an approach like the one proposed in this draft, I have trouble seeing it as anything other than opt-in, given all of the previously cited concerns, the potential for introducing deadlocks where they weren't previously, the potential for casting way too wide a net, the potential for negatively impacting the throughput of unrelated workloads, etc. _If such a mitigation was added as opt-in, and if enough testing demonstrated it was a good decision for certain workloads, then it could potentially be automatically opted-in for those, e.g. if you felt ASP.NET should always turn this on. Even there, though, I'd be surprised if in the current configuration that was a wise choice. |
Today do the implementation of those kick end up doing Task.Wait?
Yes I have similar fears but I'm looking for solutions. Maybe this isn't this one but there are aspects of this that seem interesting. Say we added a public API to enable this and it was opt-in, I could see the potential for adding a piece of middleware or filters that enables this on the current workitem before the blocking code executed. Something like this: [HttpGet("/blocking")]
[EnableBlockingMitigation] // This name is terrible
public string BlockingLegacyCodez()
{
return new WebClient().DownloadString(...); // I know we're never supposed to use this API
} That filter could yield to thread pool to get off the current work item and enable mitigation on a unique workitem that's closer to this code. The equivalent of doing this: [HttpGet("/blocking")]
public async Task<string> BlockingLegacyCodez()
{
await Task.Yield(); // Make a new work item
ThreadPool.EnableBlockingWorkitem();
return new WebClient().DownloadString(...); // I know we're never supposed to use this API
} Thoughts? |
There are a variety of patterns folks employ. Sometimes it's: Task t = Task.Run(() => ...);
... // do work
t.Wait(); Sometimes it's: Task[] tasks = new Task[...];
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = ...;
}
Task.WaitAll(tasks); Sometimes it's: Parallel.For(...); // or Invoke or ForEach Sometimes it's: var ce = new CountdownEvent(count);
for (int i = 0; i < count; i++)
{
ThreadPool.QueueUserWorkItem(_ =>
{
try { ... } finally { ce.Signal(); }
});
}
ce.Wait(); Etc. Both Parallel.* and PLINQ use and wait on Tasks internally.
FWIW, that will use the sync methods on HttpClient now.
From a "make sure we don't negatively affect unexpected things" perspective, I'd be happier with that, assuming it wouldn't require anything in the runtime to do reflection on everything to look for things that might have such an attribute. It of course still requires someone to know there's a problem, what the problem is, and what solutions there are for addressing it: at that point, we're limiting ourselves to the subset of problems where the developer is prevented from actually doing the "right thing" (e.g. they're forced to implement a sync interface and have only async tools at their disposal to do so) rather than the subset where they don't realize they're doing the "wrong thing". From a complexity and impact perspective, while I do understand the appeal of a detect-bad-actors-and-throttle-them approach, I still think we should first be looking at and testing out simpler approaches, like tweaks to thread injection strategy we've talked about previously (e.g. ramp up beyond ProcessorCount faster than we do today). |
I'm not against this approach and it is much simpler but I'm worried we'll have a different problem that we have today at the global thread pool scale if that's all we do. But maybe we can make it more clever but combining this with that strategy: Record the blocking operation and decommission a thread pool thread from the accounting. I do worry about increased contention on the global queue with this strategy which will negatively impact the performance as well (we already see this today on machines with more lots of cores). |
I don't think that's true. I think we need some knobs that workloads can use to change behavior, right now what we have works well when everyone plays ball but is quite catastrophic one there's a bad actor somewhere in the system. In that way it might not be the developer that needs to make the choice but potentially framework authors can have more granular control over things. |
I'm thinking about how this could be used to only dispatch blocking ASP.NET route delegates to the blocking queue without ever needing to dispatch non-blocking route delegates. It's too bad we have to schedule a new work item to see if a given continuation is blocking in isolation. Adding a What if we had an API that allows you to tell the ThreadPool to track if there's any blocking after you call it for a given async continuation?[1] The API would take a key (maybe derived from the route pattern in ASP.NET's case). The next time you get to the same place (matched the same route or whatever), you could explicitly query the ThreadPool using the same key. If it blocked before, only then would you need yield to the blocking queue and you could do so explicitly.[2] This way, you don't need to yield after matching the route on every request. The other upside of a more explicit API that allows framework code to query if a given continuation is blocking is that it would give the framework more options than just scheduling the blocking work item to blocking queue. Maybe, depending on the configuration, ASP.NET would skip the work item and respond with a 503 in a blocking-concurrent-request limit is hit. This is similar to what Microsoft.AspNetCore.ConcurrencyLimiter does. Maybe we could enable it by default for blocking routes. 1: This would require blocking detection for more than just the current work item but through the entire async flow until the blocking detection is explicitly stopped. In the case of a request delegate, ASP.NET could stop the detection once the request delegate's Task completes. It probably makes sense for this API to use |
Co-authored-by: David Pine <david.pine@microsoft.com>
d3ff971
to
0079975
Compare
Draft Pull Request was automatically closed for inactivity. It can be manually reopened in the next 30 days if the work resumes. |
Implementation of a technique; developed with @davidfowl, to prevent total
ThreadPool
starvation due to blocking Tasks on ThreadPool threads.Main aim is to reduce chance of the process becoming irrecoverably unresponsive from "sync-over-async" in the ThreadPool; rather than making blocking on the ThreadPool a performant option.
Mechanism:
IThreadPoolWorkItem
*Concurrency is limited to
ThreadCount - 1
so there is always the availability of a ThreadPool thread for non-blocking work.This allows non-blocking items to continue executing on the ThreadPool rather than be also blocked waiting for the blocking items to complete.
Test gist; also added as ThreadPool test.
/cc @kouvel @stephentoub