Reduce memory usage by SSLStream #49019

davidfowl · 2021-03-02T17:34:14Z

Description

For long running scenarios where reads are pending and data is only occasionally sent back and forth, the cost of using SSLStream is high from a memory POV. Today SSL Stream allocates a 8K buffer for the handshake and a 32K buffer per read. It optimizes for reading the TLS header and payload into the same buffer and in the end copies that to the user specified buffer.

Regression?

No

Data

A demo of 5000 websocket connections connected to an ASP.NET Core server using Kestrel:

Here's a heap dump sorted by inclusive size. Most of what is being held onto is array pool buffers.

Allocation profile of 5000 websocket connections. You can see the array pool has the most byte[] allocations.

They're mostly in the handshake and reading (I'm not writing in this app), just sitting mostly idle with ping pongs:

Analysis

We support specifying the buffer size for SslStream, much like FileStream, so that the user provided buffer can be used instead of the internal buffer to reduce memory usage. A bufferSize of 1 would signal to the SslStream that the buffer shouldn't be used.

I don't think we need a handshake buffer size...

The problem is SslStream has 5 constructors...

public class SslStream
{
+ public SslStream(Stream innerStream, bool leaveInnerStreamOpen, RemoteCertificateValidationCallback userCertificateValidationCallback, LocalCertificateSelectionCallback? userCertificateSelectionCallback, EncryptionPolicy encryptionPolicy, int bufferSize)
}

The text was updated successfully, but these errors were encountered:

ghost · 2021-03-02T17:34:19Z

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

For long running scenarios where reads are pending and data is only occasionally sent back and forth, the cost of using SSLStream is high from a memory POV. Today SSL Stream allocates a 8K buffer for the handshake and a 32K buffer per read. It optimizes for reading the TLS header and payload into the same buffer and in the end copies that to the user specified buffer.

Regression?

No

Data

A demo of 5000 websocket connections connected to an ASP.NET Core server using Kestrel:

Here's a heap dump sorted by inclusive size. Most of what is being held onto is array pool buffers.

Allocation profile of 5000 websocket connections. You can see the array pool has the most byte[] allocations.

They're mostly in the handshake and reading (I'm not writing in this app), just sitting mostly idle with ping pongs:

Analysis

We support specifying the buffer size for SslStream, much like FileStream, so that the user provided buffer can be used instead of the internal buffer to reduce memory usage. A bufferSize of 1 would signal to the SslStream that the buffer shouldn't be used.

The problem is SslStream has 5 constructors...

public class SslStream
{
+ public SslStream(Stream innerStream, bool leaveInnerStreamOpen, RemoteCertificateValidationCallback userCertificateValidationCallback, LocalCertificateSelectionCallback? userCertificateSelectionCallback, EncryptionPolicy encryptionPolicy, int bufferSize):  base(innerStream, leaveInnerStreamOpen)
}

Author:	davidfowl
Assignees:	-
Labels:	`area-System.Net.Security`, `tenet-performance`, `untriaged`
Milestone:	-

geoffkizer · 2021-03-02T18:12:14Z

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

geoffkizer · 2021-03-02T18:13:26Z

I do wish we used the same buffer for the handshake and the app data.

davidfowl · 2021-03-02T18:14:45Z

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

I don't understand this. Can you elaborate? Is it because the user has to pass a minimum size buffer in to decrypt the frame?

wfurt · 2021-03-02T18:17:12Z

BTW the handshake buffer should be returned back when handshake finishes. So it may not really matter. We could unify them but the renegotiation path would need to done carefully. e.g. handshake starts again during read/write

wfurt · 2021-03-02T18:38:48Z

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer. While somebody can pass in large chunk it suggest that build read (e.g. multiple frames make make sense). This would have benefit that it can be related to HTTP data e.g. be more flexible based on content rather than set once in constructor.

The note about the frames: I'm not sure if we can always decrypt less than one TLS Frame. While the API (schannel/OpensSSL) is not about frames, the buffering may happen internally. We may not see managed buffers but the memory may be taken anyway.
The old code would red 5byte header to know length of incoming frame before allocating the buffer. But that makes it quite inefficient.

Also note that the buffers are transient and rented from pool. That are not necessarily linked to life cycle of the stream.

davidfowl · 2021-03-02T19:06:58Z

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer.

That's right, if the buffer is passed to ReadAsync is

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer.

That sounds reasonable. It's really about controller the memory usage of the internal buffer. FileStream has a similar optimization so I took from prior art. If you feel like we could mostly avoid the buffer completely in SSLStream without that, then I'll defer to you.

The old code would red 5byte header to know length of incoming frame before allocating the buffer. But that makes it quite inefficient.

It's much better from a memory POV which is why you should let the caller control it so that they can optimize for the scenario they care about. In ASP.NET Core, the Stream that SSLStream wraps doesn't make sys calls, it's buffered so that cost is low compared to if you were wrapping a NetworkStream and making sys calls per read.

Also note that the buffers are transient and rented from pool. That are not necessarily linked to life cycle of the stream.

Take a look at the memory profile above. These reads are long running and the buffers will remain rented from the pool for as long as these reads are pending, which is a long time. The memory allocated today in a typical websocket scenario on the server side is 32K (SSLStream) + 4K (Kestrel's buffer) * number of websocket connections.

We're extremely memory efficient without TLS for a number of reasons, but here's a working set comparison for the same workload:

5000 websocket connections without TLS - 200MB
5000 websoket connections with TLS - 800MB

This also exacerbates the fact that the Kestrel memory pool doesn't currently shrink (our problem not yours). But this is less of an issue with non-TLS because we basically only use the memory pool once IO is ready.

wfurt · 2021-03-02T19:43:32Z

I will take a look @davidfowl. I'm remote this week and off-grid next week. But I think we should be able to look into this in 6.0.

geoffkizer · 2021-03-02T20:14:45Z

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

I don't understand this. Can you elaborate? Is it because the user has to pass a minimum size buffer in to decrypt the frame?

We have to pass the entire frame to SCHANNEL. Otherwise it can't decrypt the frame, and will give us an error like SEC_E_INCOMPLETE_MESSAGE.

davidfowl · 2021-03-02T20:26:07Z

We have to pass the entire frame to SCHANNEL. Otherwise it can't decrypt the frame, and will give us an error like SEC_E_INCOMPLETE_MESSAGE.

I see, so we can read the frame header to get the length and determine if we can full fill the request using the user's buffer or decide if we need to need an internal buffer to complete the operation successfully.

geoffkizer · 2021-03-02T20:26:18Z

One thing we could potentially do here is release the SslStream read buffer back to the pool when it becomes empty. This would add a small cost of acquiring/releasing the buffer in scenarios where we are reading in a tight loop, but it's probably not rhat expensive -- we should just measure to be sure.

That said, I'm not sure how much this helps in practice. It only helps when the user's read code does something like this:
(1) call SslStream.ReadAsync, causing us to allocate the internal buffer and do a read on the underlying stream to fill the internal buffer with at least one frame
(2) keep calling SslStream.ReadAsync until there's no data left to be consumed, causing the buffer to become empty and be released
(3) don't issue another SslStream.ReadAsync yet
(4) at some later point later, issue another SslStream.ReadAsync, which will go allocate the internal buffer again, etc...

It could be that WebSocket or something else that we care about would fit this pattern; not sure... thoughts?

geoffkizer · 2021-03-02T20:29:48Z

so we can read the frame header to get the length

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

davidfowl · 2021-03-02T20:36:42Z

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

See the comment above:

In ASP.NET Core, the Stream that SSLStream wraps doesn't make sys calls, it's buffered so that cost is low compared to if you were wrapping a NetworkStream and making sys calls per read.

wfurt · 2021-03-02T20:37:11Z

so we can read the frame header to get the length

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

That is trade-off IMHO read-ahead and maybe use more memory or try to read as much as possible.
Reading 5 byte header may not do syscall if the underlying stream reads ahead. Perhaps this could be configuration option if the underlying stream is memory.

It probably does not belong to SslStream but I have seen systems where the perf vs memory trade-off changes depending system state e.g. fast by default, cheap if resources are low.

davidfowl · 2021-03-02T20:46:17Z

Stepping back for a minute, I would say that the BCL components do have to make tradeoffs to be as generic as possible, should also be as policy free as they can be and where they can't, let the caller choose. In this case, I'd like to tradeoff throughput for memory.

geoffkizer · 2021-03-02T20:49:57Z

I'd like to tradeoff throughput for memory.

I agree, that's a reasonable thing to do; the trick is figuring out how to do this effectively.

geoffkizer · 2021-03-02T21:19:39Z

I think it would help me if I understood better the scenarios where you would want to trade off throughput for memory. Is it for WebSockets that go idle? Something else?

davidfowl · 2021-03-02T21:40:28Z

I think it would help me if I understood better the scenarios where you would want to trade off throughput for memory. Is it for WebSockets that go idle? Something else?

Yep, that's the typical scenario, lots of concurrent connections with very low traffic. Where a connected client/device is waiting for a server initiated notification. That's the scenario I'm currently testing.

We optimize for this scenario at multiple layers in the stack by doing zero byte reads, but we need to do it in a couple more places to make it more optimal.

geoffkizer · 2021-03-02T21:56:29Z

We optimize for this scenario at multiple layers in the stack by doing zero byte reads

Yeah, I almost mentioned this... I wonder if some sort of zero byte read support in SslStream would be useful here.

scalablecory · 2021-03-02T22:44:02Z

SslStream limitations seem somewhat recurring; exposing more primitives here seems worth exploring.

Wraith2 · 2021-03-02T22:45:33Z

SslStream allocations are high on the list for SqlClient performance as well, improvements here would be welcome.

davidfowl · 2021-03-02T22:45:50Z

@geoffkizer

Yeah, I almost mentioned this... I wonder if some sort of zero byte read support in SslStream would be useful here.

It already supports it. #37122

I think you're onto something, here's a strawman:

We can use the 0 byte read as an indicator to only read the header and not the payload part of the TLS Record.
Once the header read completes, we store the length of the payload, we return to the caller who will provide a buffer to read the payload.
If the user provided buffer is big enough to fit the frame, we use it directly.
If the user provided buffer is too smaller than the frame, we rent an internal buffer (size TBD) and read the frame and copy it into the user buffer (and keep filling the buffer SSlStream reads TLS records one frame at a time #49000)

geoffkizer · 2021-03-02T22:57:27Z

It already supports it. #37122

Didn't know that... awesome!

I think we could do something even simpler here...

When the user does a 0 byte read, and there's nothing in the internal buffer, then we can:
(1) release the existing buffer back to the ArrayPool
(2) Do a zero byte read on the underlying stream
(3) When that completes, complete the zero byte read to the user
(4) The next read operation will allocate the internal buffer from the ArrayPool again

Thoughts?

Edit: As you said above, we can use the 0 byte read as an "indicator" of user desire here... specifically that they don't want buffers tied up until there is data available, and they are willing to trade off at least a small amount of throughput for this.

scalablecory · 2021-03-02T23:08:16Z

It already supports it. #37122

Didn't know that... awesome!

I think we could do something even simpler here...

When the user does a 0 byte read, and there's nothing in the internal buffer, then we can:
(1) release the existing buffer back to the ArrayPool
(2) Do a zero byte read on the underlying stream
(3) When that completes, complete the zero byte read to the user
(4) The next read operation will allocate the internal buffer from the ArrayPool again

Thoughts?

Keep in mind SslStream is built on top of Stream, not Socket... 0-byte reads are not part of Stream's guarantees. We'd either special case it for NetworkStream, or have some flag the user can set to enable the behavior.

davidfowl · 2021-03-03T00:41:33Z

@geoffkizer

Yes that works as well. I guess we'd end up allocating the buffer only until the entire frame is read.

@scalablecory

Keep in mind SslStream is built on top of Stream, not Socket... 0-byte reads are not part of Stream's guarantees. We'd either special case it for NetworkStream, or have some flag the user can set to enable the behavior.

Kestrel doesn't use a NetworkStream, that wouldn't be viable.

I discussed this with @stephentoub this morning and he was thinks it might be possible to retcon the definition to allow 0 byte reads to Streams to not be an error.

If we're not OK passing on the 0 byte read to the underlying stream then we could go with the approach where we read the header, its 5 bytes so it's within the range 😄

stephentoub · 2021-03-03T02:20:06Z

We'd either special case it for NetworkStream

Special-case what? SslStream provides the "wait for data" semantics if the stream it's wrapping does as well. So as long as Kestrel uses such a stream, everything "just works". If someone wants to substitute their own custom stream, they would need to support 0-byte reads as well. If they're using a stream today that doesn't and Kestrel starts depending on it, yeah, they could be broken, but ASP.NET has taken much larger breaking changes :)

I discussed this with @stephentoub this morning and he was thinks it might be possible to retcon the definition to allow 0 byte reads to Streams to not be an error.

0-byte reads shouldn't be an error. Behavior is either:

Wait for data
Return 0 immediately

There are basically three kinds of streams:

Streams that represent connected endpoints, e.g. NetworkStream, PipeStream, FileStream around a connected handle.
Streams that wrap a fixed data source, e.g. FileStream for a disk file, MemoryStream.
Streams that delegate to other streams, e.g. SslStream, DeflateStream.

Almost every stream we have in category (1) supports 0-byte reads: I'm aware of one or two cases where that's not the case, but I think we can consider that a bug and just fix it (especially since in one case it's an inconsistency across platforms). (2) isn't relevant here. (3) is where I've seen the most inconsistency, but in most cases 0 isn't special-cased and it just exposes the underlying behavior.

I think we should update guidance around Streams to clarify expected behavior for 0-byte reads.

Category (1): the wait for data behavior
Category (2): return immediately
Category (3): behavior of the wrapped stream

and ensure our Streams map accordingly.

But, I don't think any of this really matters for the discussion at hand. Kestrel just needs to state that Streams it uses need to support 0-byte read behavior.

geoffkizer · 2021-03-03T21:05:44Z

SslStream provides the "wait for data" semantics if the stream it's wrapping does as well.

I'm looking at the code and I don't see where it's actually relying on a zero-byte read from the underlying stream. That is, SslStream is correctly implementing "wait for data" but it's doing it by simply always reading into its internal buffer.

Code here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L754

So I think the win we could harvest here is to do the following:

If the user does a zero-byte read, and our internal buffer is empty (of both encrypted and decrypted data) and thus returned to the pool, then do a zero-byte read against the underlying stream and wait for it to return, before we allocate the internal buffer.

This would mean that we wouldn't be holding on to the internal buffer during the zero byte read.

Thoughts?

stephentoub · 2021-03-03T21:07:31Z

That is, SslStream is correctly implementing "wait for data" but it's doing it by simply always reading into its internal buffer

Exactly. SslStream is in my category (3).

stephentoub · 2021-03-03T21:12:44Z

then do a zero-byte read against the underlying stream and wait for it to return, before we allocate the internal buffer

I think that's reasonable. We just need to be aware that we may be wrapping a stream that doesn't have the wait for data semantics. In practice I think what that really means is, whereas we might otherwise have chosen to do a synchronous read (since we know there's data available), we still need to do the subsequent read asynchronously, just in case there wasn't actually data available.

geoffkizer · 2021-03-03T21:17:12Z

SslStream is in my category (3).

And to be clear, what are you proposing we do for category (3) in general, in terms of guidance/conformance tests/etc?

Are you saying these should do what I'm proposing for SslStream above? That is:

When the user does a zero-byte read, and the stream needs more data from the underlying stream, then the wrapping stream should:
(1) Issue a zero-byte read against the underlying stream
(2) Try to minimize buffer/resource (e.g. free internal buffers) usage during the time between when the user does that zero byte read and when the zero byte from the underlying stream completes

I can imagine adding a test to the conformance tests to this.

Edit: To be clear, a conformance test could test for (1) above; (2) is an implementation detail not directly visible to the user.

geoffkizer · 2021-03-03T21:19:47Z

We just need to be aware that we may be wrapping a stream that doesn't have the wait for data semantics.

Yep, but as you say, the result here is benign and we would end up doing basically what we do today, just with the added cost of a zero-byte read that always succeeds immediately.

stephentoub · 2021-03-03T21:37:49Z

And to be clear, what are you proposing we do for category (3) in general, in terms of guidance/conformance tests/etc?

Basically, don't add an upfront guard for if (count == 0) return 0;", or the equivalent. Instead, a wrapper stream asked for 0 bytes should end up asking the underlying stream for 0 bytes (if the wrapper didn't already have some buffered).

geoffkizer · 2021-03-03T21:44:33Z

a wrapper stream asked for 0 bytes should end up asking the underlying stream for 0 bytes (if the wrapper didn't already have some buffered).

That's my bullet (1) above. I think suggesting (2) is useful as well, because that's what would actually help with the original scenario here.

don't add an upfront guard for if (count == 0) return 0;", or the equivalent.

I think it's more complicated than that in general; wrapping streams are typically managing internal buffers etc. But that's a detail.

stephentoub · 2021-03-03T22:07:48Z

I think it's more complicated than that in general; wrapping streams are typically managing internal buffers etc. But that's a detail.

That's what I meant by "or the equivalent" :)

davidfowl added the tenet-performance Performance related issue label Mar 2, 2021

dotnet-issue-labeler bot added area-System.Net.Security untriaged New issue has not been triaged by the area owner labels Mar 2, 2021

wfurt removed the untriaged New issue has not been triaged by the area owner label Mar 2, 2021

wfurt added this to the 6.0.0 milestone Mar 2, 2021

geoffkizer mentioned this issue Mar 3, 2021

SslStream should avoid copying read buffer data when it already has another frame to process #49086

Closed

geoffkizer mentioned this issue Mar 4, 2021

Improve zero-byte read handling in SslStream to reduce memory usage #49123

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 4, 2021

davidfowl mentioned this issue Mar 7, 2021

Reduce the size of the pipe #49270

Merged

geoffkizer closed this as completed in #49123 Mar 7, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 7, 2021

davidfowl mentioned this issue Mar 12, 2021

Use zero byte reads for incoming TLS frames dotnet/aspnetcore#30863

Merged

wfurt mentioned this issue Mar 17, 2021

improve buffer handling inside SslStream #49743

Closed

davidfowl mentioned this issue Mar 31, 2021

Memory consumption by Idle HTTPS/TLS Connections dotnet/aspnetcore#4721

Closed

ghost locked as resolved and limited conversation to collaborators Apr 6, 2021

Reduce memory usage by SSLStream #49019

Reduce memory usage by SSLStream #49019

Comments

davidfowl commented Mar 2, 2021 • edited Loading

Description

Regression?

Data

Analysis

ghost commented Mar 2, 2021

Description

Regression?

Data

Analysis

geoffkizer commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

davidfowl commented Mar 2, 2021 • edited Loading

wfurt commented Mar 2, 2021

wfurt commented Mar 2, 2021

davidfowl commented Mar 2, 2021

wfurt commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

davidfowl commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

davidfowl commented Mar 2, 2021

wfurt commented Mar 2, 2021

davidfowl commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

davidfowl commented Mar 2, 2021

geoffkizer commented Mar 2, 2021

scalablecory commented Mar 2, 2021

Wraith2 commented Mar 2, 2021

davidfowl commented Mar 2, 2021

geoffkizer commented Mar 2, 2021 • edited Loading

scalablecory commented Mar 2, 2021

davidfowl commented Mar 3, 2021 • edited Loading

stephentoub commented Mar 3, 2021 • edited Loading

geoffkizer commented Mar 3, 2021

stephentoub commented Mar 3, 2021

stephentoub commented Mar 3, 2021 • edited Loading

geoffkizer commented Mar 3, 2021 • edited Loading

geoffkizer commented Mar 3, 2021

stephentoub commented Mar 3, 2021

geoffkizer commented Mar 3, 2021

stephentoub commented Mar 3, 2021

davidfowl commented Mar 2, 2021 •

edited

Loading

davidfowl commented Mar 2, 2021 •

edited

Loading

geoffkizer commented Mar 2, 2021 •

edited

Loading

davidfowl commented Mar 3, 2021 •

edited

Loading

stephentoub commented Mar 3, 2021 •

edited

Loading

stephentoub commented Mar 3, 2021 •

edited

Loading

geoffkizer commented Mar 3, 2021 •

edited

Loading