Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] Perf_FileStream.FlushAsync benchmark hangs on Debian 11 #67545

Closed
Tracked by #67976
adamsitnik opened this issue Apr 4, 2022 · 21 comments · Fixed by #68171
Closed
Tracked by #67976

[arm64] Perf_FileStream.FlushAsync benchmark hangs on Debian 11 #67545

adamsitnik opened this issue Apr 4, 2022 · 21 comments · Fixed by #68171
Assignees
Labels
arch-arm64 area-System.Threading os-linux Linux OS (any supported distro)

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Apr 4, 2022

Reported by @carlossanlop offline. Carlos has hit this issue on Debian 11 arm64 with WSL2.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py --architecture arm64 -f net7.0 --filter '*Perf_FileStream.FlushAsync*'

Source:

using (FileStream fileStream = new FileStream("repro.txt", FileMode.Create, FileAccess.Write, FileShare.Read, 4096, FileOptions.None))
{
    for (int i = 0; i < 1024; i++)
    {
        fileStream.WriteByte(default);

        await fileStream.FlushAsync();
    }
}

In theory it could be an IO issue, but we have not touched FileStream for a few months and I suspect that it's a runtime bug similar to #64980.

@janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?

@adamsitnik adamsitnik added arch-arm64 os-linux Linux OS (any supported distro) labels Apr 4, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO untriaged New issue has not been triaged by the area owner labels Apr 4, 2022
@ghost
Copy link

ghost commented Apr 4, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Reported by @carlossanlop offline.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py --architecture arm64 -f net7.0 --filter '*Perf_FileStream.FlushAsync*'

Source:

using (FileStream fileStream = new FileStream("repro.txt", FileMode.Create, FileAccess.Write, FileShare.Read, 4096, FileOptions.None))
{
    for (int i = 0; i < 1024; i++)
    {
        fileStream.WriteByte(default);

        await fileStream.FlushAsync();
    }
}

In theory it could be an IO issue, but we have not touched FileStream for a few months and I suspect that it's a runtime bug similar to #64980.

@janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?

Author: adamsitnik
Assignees: -
Labels:

arch-arm64, area-System.IO, os-linux, untriaged

Milestone: -

@adamsitnik
Copy link
Member Author

MicrosoftTeams-image (1)

@danmoseley
Copy link
Member

Can you get a dump at the point of hang?

@janvorli
Copy link
Member

janvorli commented Apr 4, 2022

Interestingly, my attempt to run the benchmarks on Ubuntu 18.04 arm64 on my Odroid N2 device (with AOT in my case) has hung in the Perf_FileStream.ReadAsync benchmark over the weekend. Unfortunately I haven't attempted to take the dump and just killed the run. I am running it again without AOT this time, so I'll watch for the same issue.

@jeffhandley
Copy link
Member

I am hitting the same hang on my Surface Laptop 2 when using WSL2/Ubuntu 20.04. This is x64. I can repro it consistently while running only this benchmark. I'll connect with @adamsitnik to collect a dump.

@jeffhandley
Copy link
Member

I uploaded the dump file to the location where we're posting our manual run results. And I observed a peculiar behavior:

  • While plugged in, on the Windows host Performance power plan, the benchmark hangs frequently (but not every time)
  • When on battery power, on the Performance power plan, I've been unable to repro the hang
  • When plugged in or on battery power, on the Balanced power plan, I've been unable to repro the hang

@adamsitnik
Copy link
Member Author

@janvorli I assume you were not using WSL?

@jeffhandley
Copy link
Member

ReadAsync caused a hang on my WSL/Ubuntu 20.04 config as well. I uploaded dumps from that benchmark too.

@janvorli
Copy link
Member

janvorli commented Apr 7, 2022

@janvorli I assume you were not using WSL?

Right, it was a physical device in my case.

@janvorli
Copy link
Member

janvorli commented Apr 7, 2022

I just noticed that I have a hang in the ReadAsync in another benchmark run that I've started yesterday on the same device, this time in Alpine docker container.

@carlossanlop
Copy link
Member

FYI - I hit this hang in:

  • Ubuntu 20.04 x64 (Physical laptop)
  • Ubuntu 20.04 arm64 (WSL)
  • Debian 11 (WSL)

@adamsitnik
Copy link
Member Author

@janvorli is there any chance you could take a look at dump provided by Jeff? I am quite sure it's not a FileStream issue (we have not touched the code since 6.0) but I have no idea what is causing it and what is the right area

@janvorli
Copy link
Member

janvorli commented Apr 7, 2022

@adamsitnik sure, I'll take a look.

@janvorli
Copy link
Member

janvorli commented Apr 7, 2022

@jeffhandley how did you take the dumps? The readasync one is mostly readable, but it doesn't contain sufficient amount of information for viewing managed method names. The other two are somehow broken - the stack traces are not complete and SOS is unable to resolve any managed methods.

@jeffhandley
Copy link
Member

I used ~/.dotnet/shared/Microsoft.NETCore.App/6.0.3/createdump folder. I used a normal minidump. Should I use a full dump or do something different?

@janvorli
Copy link
Member

janvorli commented Apr 7, 2022

Can you please try to take a full dump (with the -u option) instead?

@danmoseley
Copy link
Member

I hit this hang on x64. I'll share a link to dumps @janvorli as soon as they're uploaded.

I'm not sure how helpful it will be -- there is really only one active thread,

> clrstack
OS Thread Id: 0x699f (0)
        Child SP               IP Call Site
00007FFDFA583D80 00007f54589ed08c [InlinedCallFrame: 00007ffdfa583d80]
00007FFDFA583D80 00007f53e0bb03ae [InlinedCallFrame: 00007ffdfa583d80]
00007FFDFA583D70 00007F53E0BB03AE Interop+Sys.<Read>g____PInvoke__|58_0(IntPtr, Byte*, Int32)
00007FFDFA583E20 00007F53E244D35B Interop+Sys.Read(System.Runtime.InteropServices.SafeHandle, Byte*, Int32) [/_/src/libraries/System.Net.Sockets/src/Microsoft.Interop.LibraryImportGenerator/Microsoft.Interop.LibraryImportGenerator/LibraryImports.g.cs @ 1486]
00007FFDFA583E60 00007F53E244D26C System.Net.Sockets.SocketPal.SysRead(System.Net.Sockets.SafeSocketHandle, System.Span`1<Byte>, Error ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 103]
00007FFDFA583EA0 00007F53E244D1BC System.Net.Sockets.SocketPal.TryCompleteReceiveFrom(System.Net.Sockets.SafeSocketHandle, System.Span`1<Byte>, System.Collections.Generic.IList`1<System.ArraySegment`1<Byte>>, System.Net.Sockets.SocketFlags, Byte[], Int32 ByRef, Int32 ByRef, System.Net.Sockets.SocketFlags ByRef, System.Net.Sockets.SocketError ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 810]
00007FFDFA583EE0 00007F53E24638D8 System.Net.Sockets.SocketAsyncContext.ReceiveFrom(System.Span`1<Byte>, System.Net.Sockets.SocketFlags ByRef, Byte[], Int32 ByRef, Int32, Int32 ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1569]
00007FFDFA583F70 00007F53E246362F System.Net.Sockets.Socket.Receive(System.Span`1<Byte>, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 1465]
00007FFDFA583FE0 00007F53E24634A2 System.IO.Pipes.PipeStream.ReadCore(System.Span`1<Byte>) [/_/src/libraries/System.IO.Pipes/src/System/IO/Pipes/PipeStream.Unix.cs @ 271]
00007FFDFA584020 00007F53E2430CAF System.IO.StreamReader.ReadBuffer()
00007FFDFA584070 00007F53E2463B1D System.IO.StreamReader.get_EndOfStream()
00007FFDFA584080 00007F53E245CD78 BenchmarkDotNet.Loggers.SynchronousProcessOutputLoggerWithDiagnoser.ProcessInput()

@danmoseley
Copy link
Member

I was able to reduce this 1 out of 5 tries with python3 ./scripts/benchmarks_monthly.py net7.0-preview3 --filter *FlushAsync* (which takes a couple minutes). So looping that should hit it in a reasonable period of time. Or perhaps looping just the innermost command it runs, after running it once to set everything up.

@LoopedBard3
Copy link
Member

We are also seeing this issue in the performance runs. We don't have any dumps, but this was occurring with the FlushAsync, WriteAsync, and ReadAsync Perf_Filestream benchmarks in our Linux x64 tests on Ubuntu.1804.

@janvorli
Copy link
Member

I have investigated the issue and the problem is in threadpool not processing a work item. As you can see from the SOS command below, there are no worker threads, the work request queue has 0 items and yet there is a queued work item. At this point, the process is waiting for that work item completion and nothing happens.

(lldb) threadpool -wi
logStart: 0
logSize: 17
CPU utilization: 0 %
Worker Thread: Total: 0 Running: 0 Idle: 0 MaxLimit: 32767 MinLimit: 8
Work Request in Queue: 0

Queued work items:
Queue Address Work Item
00007efb4283a718 00007efb42300258 Microsoft.Win32.SafeHandles.SafeFileHandle+ThreadPoolValueTaskSource
Statistics:
MT Count TotalSize Class Name
00007f3b59025970 1 152 Microsoft.Win32.SafeHandles.SafeFileHandle+ThreadPoolValueTaskSource
Total 1 objects

--------------------------------------
Number of Timers: 0
--------------------------------------
Completion Port Thread:Total: 0 Free: 0 MaxFree: 0 CurrentLimit: 0 MaxLimit: 1000 MinLimit: 0

@ghost
Copy link

ghost commented Apr 13, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Reported by @carlossanlop offline. Carlos has hit this issue on Debian 11 arm64 with WSL2.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py --architecture arm64 -f net7.0 --filter '*Perf_FileStream.FlushAsync*'

Source:

using (FileStream fileStream = new FileStream("repro.txt", FileMode.Create, FileAccess.Write, FileShare.Read, 4096, FileOptions.None))
{
    for (int i = 0; i < 1024; i++)
    {
        fileStream.WriteByte(default);

        await fileStream.FlushAsync();
    }
}

In theory it could be an IO issue, but we have not touched FileStream for a few months and I suspect that it's a runtime bug similar to #64980.

@janvorli @jkotas what would be the best way to determine the reason of the hang on Linux?

Author: adamsitnik
Assignees: -
Labels:

arch-arm64, area-System.Threading, os-linux, untriaged

Milestone: -

LoopedBard3 added a commit to LoopedBard3/performance that referenced this issue Apr 13, 2022
LoopedBard3 added a commit to LoopedBard3/performance that referenced this issue Apr 13, 2022
DrewScoggins pushed a commit to dotnet/performance that referenced this issue Apr 13, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 18, 2022
kouvel added a commit to kouvel/performance that referenced this issue Apr 18, 2022
kouvel added a commit to kouvel/runtime that referenced this issue Apr 19, 2022
There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue.

Fixes dotnet#67545
kouvel added a commit that referenced this issue Apr 19, 2022
* Fix a race condition in the thread pool

There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue.

Fixes #67545
@ghost ghost removed untriaged New issue has not been triaged by the area owner in-pr There is an active PR which will close this issue when it is merged labels Apr 19, 2022
adamsitnik pushed a commit to dotnet/performance that referenced this issue Apr 20, 2022
directhex pushed a commit to directhex/runtime that referenced this issue Apr 21, 2022
* Fix a race condition in the thread pool

There is a case where on a work-stealing queue, both `LocalPop()` and `TrySteal()` may fail when running concurrently, and lead to a case where there is a work item but no threads are released to process it. Fixed to always ensure that there's a thread request when there was a missed steal. Also when `LocalPop()` fails, the thread does not attempt to pop anymore and that can be an issue if that thread is the last thread to look for work items. Fixed to always check the local queue.

Fixes dotnet#67545
@ghost ghost locked as resolved and limited conversation to collaborators May 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Threading os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants