-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] Perf_FileStream.FlushAsync benchmark hangs on Debian 11 #67545
Comments
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsReported by @carlossanlop offline. Repro: git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py --architecture arm64 -f net7.0 --filter '*Perf_FileStream.FlushAsync*'
In theory it could be an IO issue, but we have not touched @janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?
|
Can you get a dump at the point of hang? |
Interestingly, my attempt to run the benchmarks on Ubuntu 18.04 arm64 on my Odroid N2 device (with AOT in my case) has hung in the |
I am hitting the same hang on my Surface Laptop 2 when using WSL2/Ubuntu 20.04. This is x64. I can repro it consistently while running only this benchmark. I'll connect with @adamsitnik to collect a dump. |
I uploaded the dump file to the location where we're posting our manual run results. And I observed a peculiar behavior:
|
@janvorli I assume you were not using WSL? |
ReadAsync caused a hang on my WSL/Ubuntu 20.04 config as well. I uploaded dumps from that benchmark too. |
Right, it was a physical device in my case. |
I just noticed that I have a hang in the ReadAsync in another benchmark run that I've started yesterday on the same device, this time in Alpine docker container. |
FYI - I hit this hang in:
|
@janvorli is there any chance you could take a look at dump provided by Jeff? I am quite sure it's not a |
@adamsitnik sure, I'll take a look. |
@jeffhandley how did you take the dumps? The readasync one is mostly readable, but it doesn't contain sufficient amount of information for viewing managed method names. The other two are somehow broken - the stack traces are not complete and SOS is unable to resolve any managed methods. |
I used |
Can you please try to take a full dump (with the |
I hit this hang on x64. I'll share a link to dumps @janvorli as soon as they're uploaded. I'm not sure how helpful it will be -- there is really only one active thread,
|
I was able to reduce this 1 out of 5 tries with |
We are also seeing this issue in the performance runs. We don't have any dumps, but this was occurring with the FlushAsync, WriteAsync, and ReadAsync Perf_Filestream benchmarks in our Linux x64 tests on Ubuntu.1804. |
I have investigated the issue and the problem is in threadpool not processing a work item. As you can see from the SOS command below, there are no worker threads, the work request queue has 0 items and yet there is a queued work item. At this point, the process is waiting for that work item completion and nothing happens.
|
Tagging subscribers to this area: @mangod9 Issue DetailsReported by @carlossanlop offline. Carlos has hit this issue on Debian 11 arm64 with WSL2. Repro: git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py --architecture arm64 -f net7.0 --filter '*Perf_FileStream.FlushAsync*'
In theory it could be an IO issue, but we have not touched @janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?
|
…ixed. Should mitigate issue dotnet#2366
…to be fixed. (dotnet#2371)" - Depends on dotnet/runtime#68171 - This reverts commit d8f1c47 from PR dotnet#2371
There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. Fixes dotnet#67545
* Fix a race condition in the thread pool There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. Fixes #67545
…to be fixed. (#2371)" (#2381) - Depends on dotnet/runtime#68171 - This reverts commit d8f1c47 from PR #2371
* Fix a race condition in the thread pool There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue. Fixes dotnet#67545
Reported by @carlossanlop offline. Carlos has hit this issue on Debian 11 arm64 with WSL2.
Repro:
Source:
In theory it could be an IO issue, but we have not touched
FileStream
for a few months and I suspect that it's a runtime bug similar to #64980.@janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?
The text was updated successfully, but these errors were encountered: