Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTTP/2 scaling performance #35694

Merged
merged 11 commits into from
May 7, 2020
Merged

Conversation

stephentoub
Copy link
Member

Contributes to #35184
cc: @scalablecory, @wfurt, @JamesNK

I used James' repro from the linked issue, but focused on the "r" mode, which is the one that uses HttpClient directly rather than going through higher layers. I also made some tweaks to the code, in JamesNK/Http2Perf#1, to remove some of the overhead from that layer.

Net net, at 100 concurrent requests in this repro this PR improves RPS by ~2x.

These numbers were gathered on top of #35575.


Before, "r 1 false", Kestrel, client GC

Successfully processed 13424; RPS 7723; Errors 0; Last elapsed 0.1111ms
Successfully processed 21233; RPS 7807; Errors 0; Last elapsed 0.1194ms
Successfully processed 29050; RPS 7815; Errors 0; Last elapsed 0.1277ms
Successfully processed 36905; RPS 7853; Errors 0; Last elapsed 0.1207ms
Successfully processed 44841; RPS 7935; Errors 0; Last elapsed 0.1261ms

After, "r 1 false", Kestrel, client GC

Successfully processed 13891; RPS 8071; Errors 0; Last elapsed 0.1182ms
Successfully processed 21925; RPS 8032; Errors 0; Last elapsed 0.122ms
Successfully processed 29912; RPS 7985; Errors 0; Last elapsed 0.1411ms
Successfully processed 37975; RPS 8062; Errors 0; Last elapsed 0.1402ms
Successfully processed 46108; RPS 8131; Errors 0; Last elapsed 0.1257ms


Before, "r 100 false", Kestrel, client GC

Successfully processed 79559; RPS 47748; Errors 0; Last elapsed 2.2332ms
Successfully processed 127500; RPS 47927; Errors 0; Last elapsed 2.0807ms
Successfully processed 175485; RPS 47975; Errors 0; Last elapsed 1.9676ms
Successfully processed 223111; RPS 47607; Errors 0; Last elapsed 2.0324ms
Successfully processed 271552; RPS 48428; Errors 0; Last elapsed 1.7378ms

After, "r 100 false", Kestrel, client GC

Successfully processed 157536; RPS 93753; Errors 0; Last elapsed 1.0717ms
Successfully processed 249660; RPS 92102; Errors 0; Last elapsed 0.8804ms
Successfully processed 339840; RPS 90051; Errors 0; Last elapsed 0.5867ms
Successfully processed 432333; RPS 92486; Errors 0; Last elapsed 0.3346ms
Successfully processed 525563; RPS 93227; Errors 0; Last elapsed 1.3646ms


Before, "r 100 false", Kestrel, server GC

Successfully processed 110621; RPS 42381; Errors 0; Last elapsed 2.6166ms
Successfully processed 152748; RPS 42110; Errors 0; Last elapsed 2.2373ms
Successfully processed 195372; RPS 42606; Errors 0; Last elapsed 2.2102ms
Successfully processed 237832; RPS 42451; Errors 0; Last elapsed 2.1775ms
Successfully processed 280273; RPS 42424; Errors 0; Last elapsed 2.2575ms

After, "r 100 false", Kestrel, server GC

Successfully processed 232393; RPS 101467; Errors 0; Last elapsed 1.4772ms
Successfully processed 333288; RPS 100866; Errors 0; Last elapsed 0.4939ms
Successfully processed 434999; RPS 101684; Errors 0; Last elapsed 0.325ms
Successfully processed 537804; RPS 102797; Errors 0; Last elapsed 0.4198ms
Successfully processed 638216; RPS 100403; Errors 0; Last elapsed 1.3467ms


There's still a gap with the C gRPC client, but it's much smaller now. On this benchmark HttpClient is ~18% faster with one concurrent request:

After "r 1 false", Kestrel, server GC

Successfully processed 54093; RPS 8102; Errors 0; Last elapsed 0.116ms
Successfully processed 62136; RPS 8041; Errors 0; Last elapsed 0.1253ms
Successfully processed 70204; RPS 8066; Errors 0; Last elapsed 0.124ms
Successfully processed 78212; RPS 8007; Errors 0; Last elapsed 0.1398ms
Successfully processed 86269; RPS 8055; Errors 0; Last elapsed 0.1541ms

"c 1 false", Kestrel

Successfully processed 12240; RPS 6480; Errors 0; Last elapsed 0.1659ms
Successfully processed 18691; RPS 6449; Errors 0; Last elapsed 0.149ms
Successfully processed 25202; RPS 6510; Errors 0; Last elapsed 0.1602ms
Successfully processed 31615; RPS 6411; Errors 0; Last elapsed 0.148ms
Successfully processed 38051; RPS 6435; Errors 0; Last elapsed 0.1745ms

whereas it's ~18% slower with 100 concurrent requests:

After "r 100 false", Kestrel, server GC

Successfully processed 333318; RPS 103495; Errors 0; Last elapsed 0.4614ms
Successfully processed 438239; RPS 104921; Errors 0; Last elapsed 1.3381ms
Successfully processed 541034; RPS 102781; Errors 0; Last elapsed 0.7129ms
Successfully processed 645850; RPS 104676; Errors 0; Last elapsed 0.5231ms
Successfully processed 749534; RPS 103589; Errors 0; Last elapsed 0.3373ms

"c 100 false", Kestrel

Successfully processed 234346; RPS 126174; Errors 0; Last elapsed 0.8313ms
Successfully processed 361244; RPS 126825; Errors 0; Last elapsed 0.6312ms
Successfully processed 487486; RPS 126207; Errors 0; Last elapsed 0.993ms
Successfully processed 612871; RPS 125318; Errors 0; Last elapsed 0.6118ms
Successfully processed 740618; RPS 127681; Errors 0; Last elapsed 0.5951ms


I expect there's still room for improvements; I only fixed the most egregious issues showing up in traces, and a few smaller things I noticed along the way.

The changes are broken out in individual commits. Quick summaries:

  • A "header serialization lock" was being employed and was a significant source of contention. This lock protected a shared header buffer. Rather than maintaining such a buffer on the Http2Connection, we instead just rent a pool buffer to write the headers into, which means we no longer need to serialize access to the shared buffer. The lock has been deleted.
  • Use a custom async mutex that significantly reduces contention based on the access patterns employed in Http2Connection, while also enabling us to avoid allocation per operation.
  • Changes how we handle fire-and-forget operations to employ a cheaper way of longing failures if they occur.
  • Removed several unnecessary async methods, e.g. removed a CopyToAsync wrapping task in SendRequestBodyAsync, combined AcquireWriteLockAsync into StartWriteAsync, removed the GetCancelableWaiterTask method
  • Stopped clearing an ArrayPool buffer when it was being returned; the zero'ing was showing up in traces, and everywhere else we were returning we weren't clearing.
  • Added commonly employed gRPC response headers to known headers, and added a static lookup for interned content-type strings
  • Avoided an unnecessary dictionary lookup

@stephentoub stephentoub added this to the 5.0 milestone May 1, 2020
@ghost
Copy link

ghost commented May 1, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@davidni
Copy link
Contributor

davidni commented May 1, 2020

Stopped clearing an ArrayPool buffer when it was being returned; the zero'ing was showing up in traces, and everywhere else we were returning we weren't clearing.

This seems worrisome from a security standpoint. Can we consider at a minimum using a different ArrayPool than the shared one to mitigate the possibility that code elsewhere (mis-)using ArrayPool might end up leaking sensitive data?

@stephentoub
Copy link
Member Author

stephentoub commented May 1, 2020

This seems worrisome from a security standpoint. Can we consider at a minimum using a different ArrayPool than the shared one to mitigate the possibility that code elsewhere (mis-)using ArrayPool might end up leaking sensitive data?

This is one of many places ArrayPool<byte>.Shared.Return is used without clearing, especially since not clearing is the default:
https://source.dot.net/#System.Private.CoreLib/ArrayPool.cs,34fc80a1ec3f75d7,references
That includes other places in this class:
https://source.dot.net/#System.Net.Http/ArrayBuffer.cs,129
In other words, this particular call site was an anomaly.

If we want to revisit such clearing decisions, we can, but that should be a separate issue / PR. This change is really just bringing this particular use into consistency.

Also, to my knowledge, most of pipelines and more generally Kestrel isn't clearing for its pooling, its slab memory blocks, etc. @davidfowl can correct me if I'm wrong.

@davidfowl
Copy link
Member

Also, to my knowledge, most of pipelines and more generally Kestrel isn't clearing for its pooling, its slab memory blocks, etc. @davidfowl can correct me if I'm wrong.

That's right. It's more than even pipelines, most of ASP.NET isn't clearing the memory on return in general. That goes for pipelines and other uses of the array pool.

@JamesNK
Copy link
Member

JamesNK commented May 1, 2020

Nice! That's a great improvement.

Would it be useful to you to have benchmarks from different gRPC frameworks? e.g. golang and/or Java. I haven't looked closely at gRPC client performance until now, and I'm not sure whether Grpc.Core client is a good target to aim towards.

Grpc.Core performance might be very good compared to the other gRPC clients because it is closely tied to its internal HTTP/2 client. Meanwhile the Java gRPC client is built on a general purpose HTTP/2 client: OkHttp.

On the other hand Grpc.Core performance might be bad compared to other gRPC clients because it has the overhead of native execution and then .NET wrapper on top.

@ghost
Copy link

ghost commented May 1, 2020

Seconding @davidni 's comment re memory clearing, but concurring with @stephentoub -- this is a high level security design decision that should be made at a project level, not individual PRs. It'll be most effective when coordinated consistently across a product.

Re prior art, I don't know of any pipeline that clears its buffers, but I do know of security incidents whose impact was worsened by not clearing buffers. The most recent that comes to mind was one where a particular DDoS protection gateway leaked traffic between connections via stale buffers. Clearing sensitive buffers before reuse is one way to preemptively mitigate such security incidents before they happen in the first place.

@stephentoub
Copy link
Member Author

Would it be useful to you to have benchmarks from different gRPC frameworks? e.g. golang and/or Java.

Yeah, @JamesNK, that'd be great.

@stephentoub
Copy link
Member Author

/azp run runtime-libraries stress-http

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephentoub
Copy link
Member Author

stephentoub commented May 5, 2020

I ran HttpStress on this for about a half-hour locally, and while there were a 11 failed requests out of about a million (mostly the server closing the connection), that's fewer failures than when I ran it for a similar period of time without my changes.

Copy link
Contributor

@scalablecory scalablecory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @alnikola, can we get another quick perf review once this is merged?

Many concurrent streams are currently bottlenecking in SendHeadersAsync.  They:
- take the headers serialization lock
- request a stream credit
- serialize all their headers to a buffer stored on the connection
- take the write lock
- write out the headers
- release the write lock
- release the headers serialization lock

Instead of using a header buffer pooled on the connection, we can instead temporarily grab ArrayPool buffers into which we serialize the headers, which allows us to eliminate the headers serialization lock.  With that, we get:
- request a stream credit
- serialize the headers to a pooled buffer
- take the write lock
- write out the headers
- release the write lock

This has a significant impact on the ability for many concurrent streams to scale, as all of the header serialization work happens outside of the lock.
The current implementation will always invoke the continuation, even in the majority case where there's no failure (and if there is a failure, it's cheaper not to throw it again).  We can avoid that and decrease both overhead and allocation in the common case.
We already weren't clearing as the array grew, nor are we clearing in most other places in the library.  It's not clear why we were clearing in this one spot, but the zero'ing is showing up meaningfully in profiles.
We're seeing a lot of contention trying to acquire the monitors inside of semaphore slims, even when the contention is to release the semaphore, which should entail minimal contention (and delay of a release will just cause more contention).  We can use a small interlocked gate to add a fast path.
There's not a good reason to keep them separate, and StartWriteAsync is AcquireWriteLockAsync's only caller.
We can achieve the same thing without the extra async method by putting the cleanup logic into the awaiter's GetResult.
@stephentoub
Copy link
Member Author

Thanks for reviewing, @scalablecory.

@stephentoub stephentoub merged commit cd8355b into dotnet:master May 7, 2020
@stephentoub stephentoub deleted the http2perf branch May 7, 2020 10:08
@geoffkizer
Copy link
Contributor

A "header serialization lock" was being employed and was a significant source of contention. This lock protected a shared header buffer. Rather than maintaining such a buffer on the Http2Connection, we instead just rent a pool buffer to write the headers into, which means we no longer need to serialize access to the shared buffer. The lock has been deleted.

Note that part of the reason this existed in the first place was to serialize state changes within the header encoder, so that we could potentially add dynamic header encoding in the future.

I strongly doubt we will add dynamic header encoding here in the future, so it's probably a non-issue.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants