Remove spin lock from SocketAsyncEventArgs on Windows #64770

stephentoub · 2022-02-03T22:03:57Z

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback. There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup. This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing.

However, it turns out for certain operations it's actually not that rare. In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays.

We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion. The launcher sets up all the state and then tries to transition to set the gate. Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it). Whoever gets there second is responsible for handling completion. If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.

Fixes #61233
cc: @geoffkizer, @CarnaViire

@CarnaViire, can you help validate that this does indeed fix the cited issue? Thanks!

ghost · 2022-02-03T22:04:05Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback. There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup. This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing.

However, it turns out for certain operations it's actually not that rare. In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays.

We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion. The launcher sets up all the state and then tries to transition to set the gate. Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it). Whoever gets there second is responsible for handling completion. If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.

Fixes #61233
cc: @geoffkizer, @CarnaViire

@CarnaViire, can you help validate that this does indeed fix the cited issue? Thanks!

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Net.Sockets`
Milestone:	-

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs

wfurt

nice!

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs

stephentoub · 2022-02-04T22:10:37Z

I think the (few) test failures here are valid, though I'm not seeing them locally, likely due to timing. I expect the problem is that in the case where the operation does complete asynchronously so quickly that it transfers ownership back to the launching thread, it's not doing the whole translation dance on the error code whereby it requeries the overlapped for the error. I'll refactor it accordingly on Monday to see if that addresses the issue.

stephentoub · 2022-02-06T02:19:32Z

I fixed the issue.

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback. There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup. This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing. However, it turns out for certain operations it's actually not that rare. In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays. We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion. The launcher sets up all the state and then tries to transition to set the gate. Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it). Whoever gets there second is responsible for handling completion. If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.

antonfirsov · 2022-02-07T14:40:35Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2022-02-07T14:40:49Z

Azure Pipelines successfully started running 1 pipeline(s).

antonfirsov · 2022-02-07T14:51:04Z

/azp run runtime-libraries stress-http

azure-pipelines · 2022-02-07T14:51:15Z

Azure Pipelines successfully started running 1 pipeline(s).

antonfirsov

LGTM, wish we had stress tests exercising Accept.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs

stephentoub · 2022-02-07T19:28:26Z

@antonfirsov, you added a bunch of additional CI legs. Are any of the failures relevant?

CarnaViire

LGTM. I've also confirmed that the change fixes the regression (on both pure socket and ASP.NET Core repro)

stephentoub · 2022-02-07T22:38:17Z

LGTM. I've also confirmed that the change fixes the regression (on both pure socket and ASP.NET Core repro)

Thanks for confirming.

antonfirsov · 2022-02-07T23:29:52Z

@stephentoub all CI failures are unrelated, and all Windows HttpStress runs were fully successful.

stephentoub · 2022-02-08T00:16:01Z

Excellent. Thanks.

sebastienros · 2022-02-09T15:10:59Z

Beautiful RPS and latency results

INTEL/Windows 12 cores.

| application           | net6.0  | net7.0                    |          |
| --------------------- | ------- | ------------------------- | -------- |
| CPU Usage (%)         |       8 |                        33 | +312.50% |
| Cores usage (%)       |      97 |                       399 | +311.34% |
| Working Set (MB)      |     132 |                       131 |   -0.76% |
| Private Memory (MB)   |     156 |                       154 |   -1.28% |
| Build Time (ms)       |   1,422 |                     1,601 |  +12.59% |
| Start Time (ms)       |     505 |                       506 |   +0.20% |
| Published Size (KB)   |  91,392 |                    91,372 |   -0.02% |
| .NET Core SDK Version | 6.0.102 | 7.0.100-preview.2.22108.9 |          |


| load                   | net6.0 | net7.0  |          |
| ---------------------- | ------ | ------- | -------- |
| CPU Usage (%)          |     16 |      23 |  +43.75% |
| Cores usage (%)        |    190 |     271 |  +42.63% |
| Working Set (MB)       |     38 |      38 |    0.00% |
| Private Memory (MB)    |    357 |     357 |    0.00% |
| Start Time (ms)        |      0 |       1 |      +∞% |
| First Request (ms)     |     85 |      87 |   +2.35% |
| Requests/sec           | 11,753 |  27,213 | +131.54% |
| Requests               | 94,712 | 219,293 | +131.54% |
| Mean latency (ms)      |   4.27 |    1.09 |  -74.47% |
| Max latency (ms)       |  40.51 |   15.77 |  -61.07% |
| Bad responses          |      0 |       0 |          |
| Socket errors          |      0 |       0 |          |
| Read throughput (MB/s) |   1.59 |    3.69 | +132.08% |
| Latency 50th (ms)      |   1.00 |    1.05 |   +5.00% |
| Latency 75th (ms)      |   6.88 |    1.09 |  -84.16% |
| Latency 90th (ms)      |  11.90 |    1.13 |  -90.50% |
| Latency 99th (ms)      |  22.58 |    2.26 |  -89.99% |

stephentoub · 2022-02-09T15:13:42Z

Beautiful RPS and latency results

Thanks, @sebastienros. Which test are those the results from? Did it positively or negatively impact any others we pay attention to?

sebastienros · 2022-02-09T15:19:38Z

These were the tests I mentioned in the issue, connectionclose, which is forcing the load generator to create a new connection with each request by sending a connection: close header. We track it specifically (and also the https version) on the charts. Now next step is to continue our investigation on https handshakes perf (compared to HttpSys) which led us to this issue.

stephentoub · 2022-02-09T15:20:49Z

Thanks, but my question was more about ensuring this didn't regress anything else while also fixing the previous regression.

sebastienros · 2022-02-09T15:30:29Z

I just checked and everything is the same on the most important benchmarks. They are monitored by a regression bot so if something wrong is happening we should know in the next 48h.

stephentoub · 2022-02-09T15:31:15Z

Great, thanks.

dotnet-issue-labeler bot added the area-System.Net.Sockets label Feb 3, 2022

ghost assigned stephentoub Feb 3, 2022

stephentoub added the tenet-performance Performance related issue label Feb 3, 2022

jasper-d reviewed Feb 3, 2022

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs Show resolved Hide resolved

This was referenced Feb 4, 2022

wasmtestonbrowser-system.text.regularexpressions.tests work item #64781

Closed

[wasm] sgen assertion with in System.Text.RegularExpressions.Tests #64769

Closed

wfurt approved these changes Feb 4, 2022

View reviewed changes

CarnaViire reviewed Feb 4, 2022

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs Outdated Show resolved Hide resolved

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs Outdated Show resolved Hide resolved

runfoapp bot mentioned this pull request Feb 4, 2022

sendpacketselement_filestreamlargeoffset_throws fails with timeout in net6.0 builds #63522

Closed

stephentoub force-pushed the removespinning branch from c05f7ba to c755905 Compare February 4, 2022 16:04

stephentoub force-pushed the removespinning branch from c755905 to 5092abd Compare February 5, 2022 18:51

stephentoub added 2 commits February 6, 2022 22:37

Avoid an Interlocked.Exchange in the common case

dbb82b9

stephentoub force-pushed the removespinning branch from 5092abd to dbb82b9 Compare February 7, 2022 03:49

antonfirsov approved these changes Feb 7, 2022

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs Show resolved Hide resolved

CarnaViire approved these changes Feb 7, 2022

View reviewed changes

stephentoub merged commit 1b34163 into dotnet:main Feb 8, 2022

stephentoub deleted the removespinning branch February 8, 2022 00:16

ghost locked as resolved and limited conversation to collaborators Mar 11, 2022

karelz added this to the 7.0.0 milestone Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove spin lock from SocketAsyncEventArgs on Windows #64770

Remove spin lock from SocketAsyncEventArgs on Windows #64770

stephentoub commented Feb 3, 2022

ghost commented Feb 3, 2022

wfurt left a comment

stephentoub commented Feb 4, 2022

stephentoub commented Feb 6, 2022

antonfirsov commented Feb 7, 2022

azure-pipelines bot commented Feb 7, 2022

antonfirsov commented Feb 7, 2022

azure-pipelines bot commented Feb 7, 2022

antonfirsov left a comment

stephentoub commented Feb 7, 2022

CarnaViire left a comment

stephentoub commented Feb 7, 2022

antonfirsov commented Feb 7, 2022

stephentoub commented Feb 8, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022

Remove spin lock from SocketAsyncEventArgs on Windows #64770

Remove spin lock from SocketAsyncEventArgs on Windows #64770

Conversation

stephentoub commented Feb 3, 2022

ghost commented Feb 3, 2022

wfurt left a comment

Choose a reason for hiding this comment

stephentoub commented Feb 4, 2022

stephentoub commented Feb 6, 2022

antonfirsov commented Feb 7, 2022

azure-pipelines bot commented Feb 7, 2022

antonfirsov commented Feb 7, 2022

azure-pipelines bot commented Feb 7, 2022

antonfirsov left a comment

Choose a reason for hiding this comment

stephentoub commented Feb 7, 2022

CarnaViire left a comment

Choose a reason for hiding this comment

stephentoub commented Feb 7, 2022

antonfirsov commented Feb 7, 2022

stephentoub commented Feb 8, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022

sebastienros commented Feb 9, 2022

stephentoub commented Feb 9, 2022