Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove spin lock from SocketAsyncEventArgs on Windows #64770

Merged
merged 2 commits into from
Feb 8, 2022

Conversation

stephentoub
Copy link
Member

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback. There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup. This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing.

However, it turns out for certain operations it's actually not that rare. In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays.

We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion. The launcher sets up all the state and then tries to transition to set the gate. Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it). Whoever gets there second is responsible for handling completion. If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.

Fixes #61233
cc: @geoffkizer, @CarnaViire

@CarnaViire, can you help validate that this does indeed fix the cited issue? Thanks!

@ghost
Copy link

ghost commented Feb 3, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback. There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup. This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing.

However, it turns out for certain operations it's actually not that rare. In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays.

We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion. The launcher sets up all the state and then tries to transition to set the gate. Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it). Whoever gets there second is responsible for handling completion. If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.

Fixes #61233
cc: @geoffkizer, @CarnaViire

@CarnaViire, can you help validate that this does indeed fix the cited issue? Thanks!

Author: stephentoub
Assignees: -
Labels:

area-System.Net.Sockets

Milestone: -

@stephentoub stephentoub added the tenet-performance Performance related issue label Feb 3, 2022
Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@stephentoub
Copy link
Member Author

I think the (few) test failures here are valid, though I'm not seeing them locally, likely due to timing. I expect the problem is that in the case where the operation does complete asynchronously so quickly that it transfers ownership back to the launching thread, it's not doing the whole translation dance on the error code whereby it requeries the overlapped for the error. I'll refactor it accordingly on Monday to see if that addresses the issue.

@stephentoub
Copy link
Member Author

I fixed the issue.

The Windows implementation of SocketAsyncEventArgs has a spin lock to help coordinate between the thread initiating the Winsock operation and the eventual overlapped completion callback.  There are some operations we delay (e.g. registering for cancellation) until after the operation has already been initiated and shown to complete asynchronously rather than synchronously, and as these are being set up after the Winsock call to perform the socket operation, it's possible for the overlapped completion to happen before or while we do that additional cleanup.  This condition was expected to be rare, and a spin lock is used to ensure that if the race condition does occur, the callback waits for the state set up to be complete before continuing.

However, it turns out for certain operations it's actually not that rare.  In particular, it appears that accepts when there's already a pending connection end up frequently completing asynchronously but immediately, which causes this race condition to manifest, and we've seen the spin lock spin so many times that it falls back to an actual sleep that causes unexpected delays.

We can instead just maintain a simple gate that describes whether the launching thread or the callback thread own completion.  The launcher sets up all the state and then tries to transition to set the gate.  Similarly, the first thing the callback does is set the gate (to a packed result in case the launcher needs it).  Whoever gets there second is responsible for handling completion.  If the launching thread is the one that gets there second, it essentially turns the asynchronous operation into a synchronous one, from the perspective of the caller, just as if the operation had completed synchronously.
@antonfirsov
Copy link
Member

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@antonfirsov
Copy link
Member

/azp run runtime-libraries stress-http

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, wish we had stress tests exercising Accept.

@stephentoub
Copy link
Member Author

@antonfirsov, you added a bunch of additional CI legs. Are any of the failures relevant?

Copy link
Member

@CarnaViire CarnaViire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've also confirmed that the change fixes the regression (on both pure socket and ASP.NET Core repro)

@stephentoub
Copy link
Member Author

LGTM. I've also confirmed that the change fixes the regression (on both pure socket and ASP.NET Core repro)

Thanks for confirming.

@antonfirsov
Copy link
Member

@stephentoub all CI failures are unrelated, and all Windows HttpStress runs were fully successful.

@stephentoub
Copy link
Member Author

Excellent. Thanks.

@stephentoub stephentoub merged commit 1b34163 into dotnet:main Feb 8, 2022
@stephentoub stephentoub deleted the removespinning branch February 8, 2022 00:16
@sebastienros
Copy link
Member

Beautiful RPS and latency results

INTEL/Windows 12 cores.

| application           | net6.0  | net7.0                    |          |
| --------------------- | ------- | ------------------------- | -------- |
| CPU Usage (%)         |       8 |                        33 | +312.50% |
| Cores usage (%)       |      97 |                       399 | +311.34% |
| Working Set (MB)      |     132 |                       131 |   -0.76% |
| Private Memory (MB)   |     156 |                       154 |   -1.28% |
| Build Time (ms)       |   1,422 |                     1,601 |  +12.59% |
| Start Time (ms)       |     505 |                       506 |   +0.20% |
| Published Size (KB)   |  91,392 |                    91,372 |   -0.02% |
| .NET Core SDK Version | 6.0.102 | 7.0.100-preview.2.22108.9 |          |


| load                   | net6.0 | net7.0  |          |
| ---------------------- | ------ | ------- | -------- |
| CPU Usage (%)          |     16 |      23 |  +43.75% |
| Cores usage (%)        |    190 |     271 |  +42.63% |
| Working Set (MB)       |     38 |      38 |    0.00% |
| Private Memory (MB)    |    357 |     357 |    0.00% |
| Start Time (ms)        |      0 |       1 |      +∞% |
| First Request (ms)     |     85 |      87 |   +2.35% |
| Requests/sec           | 11,753 |  27,213 | +131.54% |
| Requests               | 94,712 | 219,293 | +131.54% |
| Mean latency (ms)      |   4.27 |    1.09 |  -74.47% |
| Max latency (ms)       |  40.51 |   15.77 |  -61.07% |
| Bad responses          |      0 |       0 |          |
| Socket errors          |      0 |       0 |          |
| Read throughput (MB/s) |   1.59 |    3.69 | +132.08% |
| Latency 50th (ms)      |   1.00 |    1.05 |   +5.00% |
| Latency 75th (ms)      |   6.88 |    1.09 |  -84.16% |
| Latency 90th (ms)      |  11.90 |    1.13 |  -90.50% |
| Latency 99th (ms)      |  22.58 |    2.26 |  -89.99% |

@stephentoub
Copy link
Member Author

Beautiful RPS and latency results

Thanks, @sebastienros. Which test are those the results from? Did it positively or negatively impact any others we pay attention to?

@sebastienros
Copy link
Member

These were the tests I mentioned in the issue, connectionclose, which is forcing the load generator to create a new connection with each request by sending a connection: close header. We track it specifically (and also the https version) on the charts. Now next step is to continue our investigation on https handshakes perf (compared to HttpSys) which led us to this issue.

@stephentoub
Copy link
Member Author

Thanks, but my question was more about ensuring this didn't regress anything else while also fixing the previous regression.

@sebastienros
Copy link
Member

I just checked and everything is the same on the most important benchmarks. They are monitored by a regression bot so if something wrong is happening we should know in the next 48h.

@stephentoub
Copy link
Member Author

Great, thanks.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2022
@karelz karelz added this to the 7.0.0 milestone Apr 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in connections creation performance
7 participants