Improve HTTP/2 scaling performance #35694

stephentoub · 2020-05-01T02:28:56Z

Contributes to #35184
cc: @scalablecory, @wfurt, @JamesNK

I used James' repro from the linked issue, but focused on the "r" mode, which is the one that uses HttpClient directly rather than going through higher layers. I also made some tweaks to the code, in JamesNK/Http2Perf#1, to remove some of the overhead from that layer.

Net net, at 100 concurrent requests in this repro this PR improves RPS by ~2x.

These numbers were gathered on top of #35575.

Before, "r 1 false", Kestrel, client GC

Successfully processed 13424; RPS 7723; Errors 0; Last elapsed 0.1111ms
Successfully processed 21233; RPS 7807; Errors 0; Last elapsed 0.1194ms
Successfully processed 29050; RPS 7815; Errors 0; Last elapsed 0.1277ms
Successfully processed 36905; RPS 7853; Errors 0; Last elapsed 0.1207ms
Successfully processed 44841; RPS 7935; Errors 0; Last elapsed 0.1261ms

After, "r 1 false", Kestrel, client GC

Successfully processed 13891; RPS 8071; Errors 0; Last elapsed 0.1182ms
Successfully processed 21925; RPS 8032; Errors 0; Last elapsed 0.122ms
Successfully processed 29912; RPS 7985; Errors 0; Last elapsed 0.1411ms
Successfully processed 37975; RPS 8062; Errors 0; Last elapsed 0.1402ms
Successfully processed 46108; RPS 8131; Errors 0; Last elapsed 0.1257ms

Before, "r 100 false", Kestrel, client GC

Successfully processed 79559; RPS 47748; Errors 0; Last elapsed 2.2332ms
Successfully processed 127500; RPS 47927; Errors 0; Last elapsed 2.0807ms
Successfully processed 175485; RPS 47975; Errors 0; Last elapsed 1.9676ms
Successfully processed 223111; RPS 47607; Errors 0; Last elapsed 2.0324ms
Successfully processed 271552; RPS 48428; Errors 0; Last elapsed 1.7378ms

After, "r 100 false", Kestrel, client GC

Successfully processed 157536; RPS 93753; Errors 0; Last elapsed 1.0717ms
Successfully processed 249660; RPS 92102; Errors 0; Last elapsed 0.8804ms
Successfully processed 339840; RPS 90051; Errors 0; Last elapsed 0.5867ms
Successfully processed 432333; RPS 92486; Errors 0; Last elapsed 0.3346ms
Successfully processed 525563; RPS 93227; Errors 0; Last elapsed 1.3646ms

Before, "r 100 false", Kestrel, server GC

Successfully processed 110621; RPS 42381; Errors 0; Last elapsed 2.6166ms
Successfully processed 152748; RPS 42110; Errors 0; Last elapsed 2.2373ms
Successfully processed 195372; RPS 42606; Errors 0; Last elapsed 2.2102ms
Successfully processed 237832; RPS 42451; Errors 0; Last elapsed 2.1775ms
Successfully processed 280273; RPS 42424; Errors 0; Last elapsed 2.2575ms

After, "r 100 false", Kestrel, server GC

Successfully processed 232393; RPS 101467; Errors 0; Last elapsed 1.4772ms
Successfully processed 333288; RPS 100866; Errors 0; Last elapsed 0.4939ms
Successfully processed 434999; RPS 101684; Errors 0; Last elapsed 0.325ms
Successfully processed 537804; RPS 102797; Errors 0; Last elapsed 0.4198ms
Successfully processed 638216; RPS 100403; Errors 0; Last elapsed 1.3467ms

There's still a gap with the C gRPC client, but it's much smaller now. On this benchmark HttpClient is ~18% faster with one concurrent request:

After "r 1 false", Kestrel, server GC

Successfully processed 54093; RPS 8102; Errors 0; Last elapsed 0.116ms
Successfully processed 62136; RPS 8041; Errors 0; Last elapsed 0.1253ms
Successfully processed 70204; RPS 8066; Errors 0; Last elapsed 0.124ms
Successfully processed 78212; RPS 8007; Errors 0; Last elapsed 0.1398ms
Successfully processed 86269; RPS 8055; Errors 0; Last elapsed 0.1541ms

"c 1 false", Kestrel

Successfully processed 12240; RPS 6480; Errors 0; Last elapsed 0.1659ms
Successfully processed 18691; RPS 6449; Errors 0; Last elapsed 0.149ms
Successfully processed 25202; RPS 6510; Errors 0; Last elapsed 0.1602ms
Successfully processed 31615; RPS 6411; Errors 0; Last elapsed 0.148ms
Successfully processed 38051; RPS 6435; Errors 0; Last elapsed 0.1745ms

whereas it's ~18% slower with 100 concurrent requests:

After "r 100 false", Kestrel, server GC

Successfully processed 333318; RPS 103495; Errors 0; Last elapsed 0.4614ms
Successfully processed 438239; RPS 104921; Errors 0; Last elapsed 1.3381ms
Successfully processed 541034; RPS 102781; Errors 0; Last elapsed 0.7129ms
Successfully processed 645850; RPS 104676; Errors 0; Last elapsed 0.5231ms
Successfully processed 749534; RPS 103589; Errors 0; Last elapsed 0.3373ms

"c 100 false", Kestrel

Successfully processed 234346; RPS 126174; Errors 0; Last elapsed 0.8313ms
Successfully processed 361244; RPS 126825; Errors 0; Last elapsed 0.6312ms
Successfully processed 487486; RPS 126207; Errors 0; Last elapsed 0.993ms
Successfully processed 612871; RPS 125318; Errors 0; Last elapsed 0.6118ms
Successfully processed 740618; RPS 127681; Errors 0; Last elapsed 0.5951ms

I expect there's still room for improvements; I only fixed the most egregious issues showing up in traces, and a few smaller things I noticed along the way.

The changes are broken out in individual commits. Quick summaries:

A "header serialization lock" was being employed and was a significant source of contention. This lock protected a shared header buffer. Rather than maintaining such a buffer on the Http2Connection, we instead just rent a pool buffer to write the headers into, which means we no longer need to serialize access to the shared buffer. The lock has been deleted.
Use a custom async mutex that significantly reduces contention based on the access patterns employed in Http2Connection, while also enabling us to avoid allocation per operation.
Changes how we handle fire-and-forget operations to employ a cheaper way of longing failures if they occur.
Removed several unnecessary async methods, e.g. removed a CopyToAsync wrapping task in SendRequestBodyAsync, combined AcquireWriteLockAsync into StartWriteAsync, removed the GetCancelableWaiterTask method
Stopped clearing an ArrayPool buffer when it was being returned; the zero'ing was showing up in traces, and everywhere else we were returning we weren't clearing.
Added commonly employed gRPC response headers to known headers, and added a static lookup for interned content-type strings
Avoided an unnecessary dictionary lookup

ghost · 2020-05-01T02:28:59Z

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

src/libraries/System.Net.Http/src/System/Threading/AsyncMutex.cs

davidni · 2020-05-01T19:59:06Z

Stopped clearing an ArrayPool buffer when it was being returned; the zero'ing was showing up in traces, and everywhere else we were returning we weren't clearing.

This seems worrisome from a security standpoint. Can we consider at a minimum using a different ArrayPool than the shared one to mitigate the possibility that code elsewhere (mis-)using ArrayPool might end up leaking sensitive data?

stephentoub · 2020-05-01T20:01:57Z

This seems worrisome from a security standpoint. Can we consider at a minimum using a different ArrayPool than the shared one to mitigate the possibility that code elsewhere (mis-)using ArrayPool might end up leaking sensitive data?

This is one of many places ArrayPool<byte>.Shared.Return is used without clearing, especially since not clearing is the default:
https://source.dot.net/#System.Private.CoreLib/ArrayPool.cs,34fc80a1ec3f75d7,references
That includes other places in this class:
https://source.dot.net/#System.Net.Http/ArrayBuffer.cs,129
In other words, this particular call site was an anomaly.

If we want to revisit such clearing decisions, we can, but that should be a separate issue / PR. This change is really just bringing this particular use into consistency.

Also, to my knowledge, most of pipelines and more generally Kestrel isn't clearing for its pooling, its slab memory blocks, etc. @davidfowl can correct me if I'm wrong.

davidfowl · 2020-05-01T20:11:47Z

Also, to my knowledge, most of pipelines and more generally Kestrel isn't clearing for its pooling, its slab memory blocks, etc. @davidfowl can correct me if I'm wrong.

That's right. It's more than even pipelines, most of ASP.NET isn't clearing the memory on return in general. That goes for pipelines and other uses of the array pool.

JamesNK · 2020-05-01T20:11:49Z

Nice! That's a great improvement.

Would it be useful to you to have benchmarks from different gRPC frameworks? e.g. golang and/or Java. I haven't looked closely at gRPC client performance until now, and I'm not sure whether Grpc.Core client is a good target to aim towards.

Grpc.Core performance might be very good compared to the other gRPC clients because it is closely tied to its internal HTTP/2 client. Meanwhile the Java gRPC client is built on a general purpose HTTP/2 client: OkHttp.

On the other hand Grpc.Core performance might be bad compared to other gRPC clients because it has the overhead of native execution and then .NET wrapper on top.

src/libraries/System.Net.Http/src/System/Threading/AsyncMutex.cs

src/libraries/System.Net.Http/src/System/Net/Http/Headers/HeaderDescriptor.cs

ghost · 2020-05-01T20:27:00Z

Seconding @davidni 's comment re memory clearing, but concurring with @stephentoub -- this is a high level security design decision that should be made at a project level, not individual PRs. It'll be most effective when coordinated consistently across a product.

Re prior art, I don't know of any pipeline that clears its buffers, but I do know of security incidents whose impact was worsened by not clearing buffers. The most recent that comes to mind was one where a particular DDoS protection gateway leaked traffic between connections via stale buffers. Clearing sensitive buffers before reuse is one way to preemptively mitigate such security incidents before they happen in the first place.

stephentoub · 2020-05-01T20:34:20Z

Would it be useful to you to have benchmarks from different gRPC frameworks? e.g. golang and/or Java.

Yeah, @JamesNK, that'd be great.

stephentoub · 2020-05-04T14:11:36Z

/azp run runtime-libraries stress-http

azure-pipelines · 2020-05-04T14:11:45Z

Azure Pipelines successfully started running 1 pipeline(s).

stephentoub · 2020-05-05T22:12:45Z

I ran HttpStress on this for about a half-hour locally, and while there were a 11 failed requests out of about a million (mostly the server closing the connection), that's fewer failures than when I ran it for a similar period of time without my changes.

scalablecory

lgtm. @alnikola, can we get another quick perf review once this is merged?

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Connection.cs

Many concurrent streams are currently bottlenecking in SendHeadersAsync. They: - take the headers serialization lock - request a stream credit - serialize all their headers to a buffer stored on the connection - take the write lock - write out the headers - release the write lock - release the headers serialization lock Instead of using a header buffer pooled on the connection, we can instead temporarily grab ArrayPool buffers into which we serialize the headers, which allows us to eliminate the headers serialization lock. With that, we get: - request a stream credit - serialize the headers to a pooled buffer - take the write lock - write out the headers - release the write lock This has a significant impact on the ability for many concurrent streams to scale, as all of the header serialization work happens outside of the lock.

The current implementation will always invoke the continuation, even in the majority case where there's no failure (and if there is a failure, it's cheaper not to throw it again). We can avoid that and decrease both overhead and allocation in the common case.

We already weren't clearing as the array grew, nor are we clearing in most other places in the library. It's not clear why we were clearing in this one spot, but the zero'ing is showing up meaningfully in profiles.

We're seeing a lot of contention trying to acquire the monitors inside of semaphore slims, even when the contention is to release the semaphore, which should entail minimal contention (and delay of a release will just cause more contention). We can use a small interlocked gate to add a fast path.

There's not a good reason to keep them separate, and StartWriteAsync is AcquireWriteLockAsync's only caller.

We can achieve the same thing without the extra async method by putting the cleanup logic into the awaiter's GetResult.

stephentoub · 2020-05-07T01:09:59Z

Thanks for reviewing, @scalablecory.

geoffkizer · 2020-06-02T17:14:11Z

A "header serialization lock" was being employed and was a significant source of contention. This lock protected a shared header buffer. Rather than maintaining such a buffer on the Http2Connection, we instead just rent a pool buffer to write the headers into, which means we no longer need to serialize access to the shared buffer. The lock has been deleted.

Note that part of the reason this existed in the first place was to serialize state changes within the header encoder, so that we could potentially add dynamic header encoding in the future.

I strongly doubt we will add dynamic header encoding here in the future, so it's probably a non-issue.

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Connection.cs

stephentoub added area-System.Net.Http tenet-performance Performance related issue labels May 1, 2020

stephentoub added this to the 5.0 milestone May 1, 2020

gfoidl reviewed May 1, 2020

View reviewed changes

src/libraries/System.Net.Http/src/System/Threading/AsyncMutex.cs Outdated Show resolved Hide resolved

src/libraries/System.Net.Http/src/System/Threading/AsyncMutex.cs Show resolved Hide resolved

stephentoub force-pushed the http2perf branch from 2d7fb05 to 17bdf85 Compare May 1, 2020 19:29

scalablecory reviewed May 1, 2020

View reviewed changes

src/libraries/System.Net.Http/src/System/Threading/AsyncMutex.cs Show resolved Hide resolved

danmoseley reviewed May 1, 2020

View reviewed changes

src/libraries/System.Net.Http/src/System/Net/Http/Headers/HeaderDescriptor.cs Show resolved Hide resolved

stephentoub force-pushed the http2perf branch from 17bdf85 to 6c10c73 Compare May 3, 2020 21:50

stephentoub force-pushed the http2perf branch from 6c10c73 to e9bf2d1 Compare May 5, 2020 13:46

jaredpar mentioned this pull request May 5, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

scalablecory approved these changes May 6, 2020

View reviewed changes

stephentoub added 9 commits May 6, 2020 21:03

Remove CopyToAsync wrapping task from SendRequestBodyAsync

28d1eb8

Stop clearing ArrayBuffer's pooled byte arrays

8dc3380

We already weren't clearing as the array grew, nor are we clearing in most other places in the library. It's not clear why we were clearing in this one spot, but the zero'ing is showing up meaningfully in profiles.

Add several gRPC known headers

a131d4f

Avoid dictionary lookup in HttpMethod.Normalize

15bc42d

Avoid allocating strings for known Content-Types

9b6953e

Combine AcquireWriteLockAsync into StartWriteAsync

d0d706b

There's not a good reason to keep them separate, and StartWriteAsync is AcquireWriteLockAsync's only caller.

Remove GetCancelableWaiterTask async method

1f04ca1

We can achieve the same thing without the extra async method by putting the cleanup logic into the awaiter's GetResult.

Address PR feedback

3605fcb

stephentoub force-pushed the http2perf branch from e9bf2d1 to 3605fcb Compare May 7, 2020 01:12

stephentoub merged commit cd8355b into dotnet:master May 7, 2020

stephentoub deleted the http2perf branch May 7, 2020 10:08

geoffkizer reviewed Jun 2, 2020

View reviewed changes

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Connection.cs Show resolved Hide resolved

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HTTP/2 scaling performance #35694

Improve HTTP/2 scaling performance #35694

stephentoub commented May 1, 2020

ghost commented May 1, 2020

davidni commented May 1, 2020

stephentoub commented May 1, 2020 •

edited

Loading

davidfowl commented May 1, 2020

JamesNK commented May 1, 2020

ghost commented May 1, 2020 •

edited by ghost

Loading

stephentoub commented May 1, 2020

stephentoub commented May 4, 2020

azure-pipelines bot commented May 4, 2020

stephentoub commented May 5, 2020 •

edited

Loading

scalablecory left a comment

stephentoub commented May 7, 2020

geoffkizer commented Jun 2, 2020

Improve HTTP/2 scaling performance #35694

Improve HTTP/2 scaling performance #35694

Conversation

stephentoub commented May 1, 2020

Before, "r 1 false", Kestrel, client GC

After, "r 1 false", Kestrel, client GC

Before, "r 100 false", Kestrel, client GC

After, "r 100 false", Kestrel, client GC

Before, "r 100 false", Kestrel, server GC

After, "r 100 false", Kestrel, server GC

After "r 1 false", Kestrel, server GC

"c 1 false", Kestrel

After "r 100 false", Kestrel, server GC

"c 100 false", Kestrel

ghost commented May 1, 2020

davidni commented May 1, 2020

stephentoub commented May 1, 2020 • edited Loading

davidfowl commented May 1, 2020

JamesNK commented May 1, 2020

ghost commented May 1, 2020 • edited by ghost Loading

stephentoub commented May 1, 2020

stephentoub commented May 4, 2020

azure-pipelines bot commented May 4, 2020

stephentoub commented May 5, 2020 • edited Loading

scalablecory left a comment

Choose a reason for hiding this comment

stephentoub commented May 7, 2020

geoffkizer commented Jun 2, 2020

stephentoub commented May 1, 2020 •

edited

Loading

ghost commented May 1, 2020 •

edited by ghost

Loading

stephentoub commented May 5, 2020 •

edited

Loading