Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'dotnet restore' performance regression using SocketsHttpHandler #27215

Closed
davidsh opened this issue Aug 23, 2018 · 11 comments
Closed

'dotnet restore' performance regression using SocketsHttpHandler #27215

davidsh opened this issue Aug 23, 2018 · 11 comments
Labels
bug tenet-performance Performance related issue
Milestone

Comments

@davidsh
Copy link
Contributor

davidsh commented Aug 23, 2018

Original problem report

From: Matt Mitchell
Sent: Thursday, August 2, 2018 10:03 PM
To: Geoffrey Kizer; David Shulman
Cc: Stephen Toub
Subject: 2-5x perf regression (and probably correctness bug) with new http sockets in .NET Core 2.1

Hey,

I’ve been running the 2.2 builds for a while now, and was noticing a number of restore failures when running CLI testing. The restore failures manifest as a number of endpoint timeouts (endpoint did not respond in 100 seconds) and SSL connection termination errors. This did not happen in 2.1, so I started poking around. After much debugging and some assistance from David, it was discovered that the failures go away when DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER is set to 0. In fact the restore gets much much faster. I poked around and discovered this is also the case with 2.1.3, 2.1.2, and presumably 2.1.* since not many networking changes have gone in. Here are some high level points of what I know.

• A restore of the default mvc app using a simple nuget.config (only nuget.org) takes about 1:30 with the new sockets on.
• A restore of the default mvc app using a simple nuget.config (only nuget.org) takes about :36 with the new sockets off.
• A restore of the default mvc app using CLI’s nuget.config for the 2.2. build (which contains ~10 sources) takes about 5:00 with the new sockets on.
• A restore of the default mvc app using CLI’s nuget.config for the 2.2. build (which contains ~10 sources) takes about :50 with the new sockets off.


Update

On my dev machine, I was able to repro this performance regression. Using WinHttpHandler, the scenario runs in about 23 seconds. But using SocketsHttpHandler, it runs between 3- 5 minutes.

@davidsh davidsh self-assigned this Aug 23, 2018
@davidsh
Copy link
Contributor Author

davidsh commented Aug 23, 2018

We’ve discovered what seems to be an IO thread starvation problem with SocketsHttpHandler in the NuGet client libraries (NuGet.Protocol.dll).

Background: In 2.1, the default HttpClientHandler changed from WinHttpHandler to the new SocketsHttpHandler. SocketsHttpHandler is built on managed sockets and thus uses our IOCP and IO threads to process completions. WinHttpHandler does not use the .NET threadpool. It is built on top of native WinHTTP and so uses the native threadpool. Therefore, it doesn’t hit this problem.

The Nuget client is issuing many (100s) of async HTTP requests in a short period of time (a couple seconds). As these HTTP requests complete, the Nuget app’s async continuation (running on an IO thread) seems to be doing FileStream operations that cause the IO thread to block waiting on an async file completion. In particular, it’s blocking here:

https://github.com/dotnet/corefx/blob/b384b309061c050a31dcf2b8f377f5da244fcf7b/src/Common/src/CoreLib/System/IO/FileStream.Windows.cs#L329

(There may be other places where IO threads are getting blocked, but this seems to be by far the most common.)

In this particular scenario, the Nuget code is calling Seek on the FileStream, and that’s apparently forcing a buffer flush. And since Seek is a synchronous call, the flush needs to be synchronous as well which also causes a blocking of the IO thread.

https://github.com/NuGet/NuGet.Client/blob/a3d835019b3cc8446638e308cdd794760189ccb5/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpCacheUtility.cs#L84

// Validate the content before putting it into the cache.
fileStream.Seek(0, SeekOrigin.Begin);
ensureValidContents?.Invoke(fileStream);

As the IO threads block, we seem to (sometimes) get into a thread starvation death spiral. The file completions they are waiting for are queued behind a bunch of HTTP request completions. As the request completions are processed, they block more threads. Eventually, if we’re unlucky, all IO thread block. We will make progress slowly as the threadpool injects a new IO thread every half second, but these soon block for the same reason, and our progress is slowed to a crawl until the completions finally get drained (or we hit timeouts that cause a bunch of cancellations).

Setting min IO threads to something large like 128 (on an 8 proc machine) fixes this issue.

Increasing min IO threads is a workaround, and may be a viable short-term fix. But it seems like there’s a deeper issue here.

@davidsh
Copy link
Contributor Author

davidsh commented Aug 23, 2018

We also discovered another place in FileStream where async operations actually end up being blocked by sync operations:

https://github.com/dotnet/corefx/blob/b384b309061c050a31dcf2b8f377f5da244fcf7b/src/Common/src/CoreLib/System/IO/FileStream.Windows.cs#L1547-L1561

@davidsh
Copy link
Contributor Author

davidsh commented Aug 23, 2018

In order to workaround this behavior, we were able to determine a small fix to the NuGet client:

Change this line:
https://github.com/NuGet/NuGet.Client/blob/a3d835019b3cc8446638e308cdd794760189ccb5/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpCacheUtility.cs#L75

to pass in 1 for the Buffer size.

Old

using (var fileStream = new FileStream(
    result.NewFile,
    FileMode.Create,
    FileAccess.ReadWrite,
    FileShare.None,
    BufferSize,
    useAsync: true))

New

using (var fileStream = new FileStream(
    result.NewFile,
    FileMode.Create,
    FileAccess.ReadWrite,
    FileShare.None,
    1,
    useAsync: true))

That'll effectively disable the FileStream's buffering and prevent sync blocking of the IO threadpool threads.

Using the fix, building a new NuGet.Protocol.dll binary, and dropping it into the repro results in performance back to baseline. On my dev machine, I'm getting 20-25 second performance (same as WinHttpHandler) instead of taking 3-5 minutes.

@karelz
Copy link
Member

karelz commented Aug 23, 2018

Given that it has fairly nasty impact on certain customer(s), we need to find an "easy" way to detect that we're hitting this problem - either from dump, or via some counters/events. Changing code and seeing if it helps should be the last resort detection mechanism.

@kouvel we may need your help to look into IO threadpool to see what we can tweak to avoid these kind of problems.

cc @geoffkizer @stephentoub

@davidsh davidsh removed their assignment Aug 23, 2018
@davidfowl
Copy link
Member

As these HTTP requests complete, the Nuget app’s async continuation (running on an IO thread)

@davidsh how did that code end up running on an IO thread?

@davidsh
Copy link
Contributor Author

davidsh commented Aug 23, 2018

@davidsh how did that code end up running on an IO thread?

I will forward you the longer email thread from which I summarized the above.

@davidsh
Copy link
Contributor Author

davidsh commented Aug 23, 2018

@davidsh how did that code end up running on an IO thread? Don't we always dispatch continuations to worker threads?

From: @stephentoub

Nope. On Linux we get off of the single dedicated thread that listens to the epoll by queuing to the worker pool, but on Windows we just run where we get called back, which is on whatever io thread serviced the completion port.

@danmoseley
Copy link
Member

@davidsh should this move out of 2.1.x milestone?

@davidsh
Copy link
Contributor Author

davidsh commented Sep 25, 2018

@davidsh should this move out of 2.1.x milestone?

Probably since it is not being actively championed.

cc: @geoffkizer @karelz @stephentoub

@karelz
Copy link
Member

karelz commented Oct 4, 2018

The conclusion was to fix nuget client - so that part is external.
There are separate discussions about improving .NET Core thread pool in general, but that is separate from this issue.

Closing.

@karelz karelz closed this as completed Oct 4, 2018
@stephentoub
Copy link
Member

There are separate discussions about improving .NET Core thread pool in general, but that is separate from this issue.

The right fix, which is what the "fix nuget client" is working around, is to address FileStream's inappropriate blocking.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants