-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race condition in the thread pool #68171
Conversation
Tagging subscribers to this area: @mangod9 Issue DetailsThere is a case where on a work-stealing queue, both Fixes #67545
|
…to be fixed. (dotnet#2371)" - Depends on dotnet/runtime#68171 - This reverts commit d8f1c47 from PR dotnet#2371
Is a unit test feasible here? If nothing else, something derived from the perf test, even if it only would catch this bug sporadically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. Fixes dotnet#67545
The test appears to occasionally be timing out on Linux_musl x64. I'd like to get this fix into preview 4 if possible. I think I'll disable it for this PR with an active issue and investigate that separately, if it's a product issue it would likely be a different one. |
You can add |
It didn't fail in the first set of runs and failed in the second set. I'm not sure where else it may fail occasionally. |
It might just be a few too many iterations on some machines, especially with a checked runtime. I'll try reducing the iterations. |
Combined with the previous CI runs and the latest harmless commit I'll go ahead and merge to make it into preview 4 |
|
||
static string CreateFileWithRandomContent(int fileSize) | ||
{ | ||
string filePath = Path.GetTempFileName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside, better to use the TempFile class, which allows the using pattern instead of writing out try/finally. It also avoids the problematic Path.GetTempFileName() which fails if someone else has leaked temp files.
It has a ctor that takes an array, too.
public TempFile(string path, byte[] data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I'll fix it up in a follow-up PR
…to be fixed. (#2371)" (#2381) - Depends on dotnet/runtime#68171 - This reverts commit d8f1c47 from PR #2371
* Fix a race condition in the thread pool There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. Fixes dotnet#67545
@kouvel there seem to be some persistent regressions in ping-pong tests that may be a result of this change: dotnet/perf-autofiling-issues#4817 Is this expected? |
There is a case where on a work-stealing queue, both
LocalPop()
andTrySteal()
fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also whenLocalPop()
fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. The issues were introduced by #64834.Fixes #67545