Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage by SSLStream #49019

Closed
davidfowl opened this issue Mar 2, 2021 · 34 comments · Fixed by #49123
Closed

Reduce memory usage by SSLStream #49019

davidfowl opened this issue Mar 2, 2021 · 34 comments · Fixed by #49123
Labels
Milestone

Comments

@davidfowl
Copy link
Member

davidfowl commented Mar 2, 2021

Description

For long running scenarios where reads are pending and data is only occasionally sent back and forth, the cost of using SSLStream is high from a memory POV. Today SSL Stream allocates a 8K buffer for the handshake and a 32K buffer per read. It optimizes for reading the TLS header and payload into the same buffer and in the end copies that to the user specified buffer.

Regression?

No

Data

A demo of 5000 websocket connections connected to an ASP.NET Core server using Kestrel:

Here's a heap dump sorted by inclusive size. Most of what is being held onto is array pool buffers.
image

Allocation profile of 5000 websocket connections. You can see the array pool has the most byte[] allocations.

image

They're mostly in the handshake and reading (I'm not writing in this app), just sitting mostly idle with ping pongs:

image

Analysis

We support specifying the buffer size for SslStream, much like FileStream, so that the user provided buffer can be used instead of the internal buffer to reduce memory usage. A bufferSize of 1 would signal to the SslStream that the buffer shouldn't be used.

I don't think we need a handshake buffer size...

The problem is SslStream has 5 constructors...

public class SslStream
{
+ public SslStream(Stream innerStream, bool leaveInnerStreamOpen, RemoteCertificateValidationCallback userCertificateValidationCallback, LocalCertificateSelectionCallback? userCertificateSelectionCallback, EncryptionPolicy encryptionPolicy, int bufferSize)
}
@davidfowl davidfowl added the tenet-performance Performance related issue label Mar 2, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net.Security untriaged New issue has not been triaged by the area owner labels Mar 2, 2021
@ghost
Copy link

ghost commented Mar 2, 2021

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

For long running scenarios where reads are pending and data is only occasionally sent back and forth, the cost of using SSLStream is high from a memory POV. Today SSL Stream allocates a 8K buffer for the handshake and a 32K buffer per read. It optimizes for reading the TLS header and payload into the same buffer and in the end copies that to the user specified buffer.

Regression?

No

Data

A demo of 5000 websocket connections connected to an ASP.NET Core server using Kestrel:

Here's a heap dump sorted by inclusive size. Most of what is being held onto is array pool buffers.
image

Allocation profile of 5000 websocket connections. You can see the array pool has the most byte[] allocations.

image

They're mostly in the handshake and reading (I'm not writing in this app), just sitting mostly idle with ping pongs:

image

Analysis

We support specifying the buffer size for SslStream, much like FileStream, so that the user provided buffer can be used instead of the internal buffer to reduce memory usage. A bufferSize of 1 would signal to the SslStream that the buffer shouldn't be used.

The problem is SslStream has 5 constructors...

public class SslStream
{
+ public SslStream(Stream innerStream, bool leaveInnerStreamOpen, RemoteCertificateValidationCallback userCertificateValidationCallback, LocalCertificateSelectionCallback? userCertificateSelectionCallback, EncryptionPolicy encryptionPolicy, int bufferSize):  base(innerStream, leaveInnerStreamOpen)
}
Author: davidfowl
Assignees: -
Labels:

area-System.Net.Security, tenet-performance, untriaged

Milestone: -

@geoffkizer
Copy link
Contributor

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

@geoffkizer
Copy link
Contributor

I do wish we used the same buffer for the handshake and the app data.

@davidfowl
Copy link
Member Author

davidfowl commented Mar 2, 2021

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

I don't understand this. Can you elaborate? Is it because the user has to pass a minimum size buffer in to decrypt the frame?

@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

BTW the handshake buffer should be returned back when handshake finishes. So it may not really matter. We could unify them but the renegotiation path would need to done carefully. e.g. handshake starts again during read/write

@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer. While somebody can pass in large chunk it suggest that build read (e.g. multiple frames make make sense). This would have benefit that it can be related to HTTP data e.g. be more flexible based on content rather than set once in constructor.

The note about the frames: I'm not sure if we can always decrypt less than one TLS Frame. While the API (schannel/OpensSSL) is not about frames, the buffering may happen internally. We may not see managed buffers but the memory may be taken anyway.
The old code would red 5byte header to know length of incoming frame before allocating the buffer. But that makes it quite inefficient.

Also note that the buffers are transient and rented from pool. That are not necessarily linked to life cycle of the stream.

@davidfowl
Copy link
Member Author

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer.

That's right, if the buffer is passed to ReadAsync is

It feels like this could be driven by buffer size passed to Read, right? If somebody passes buffer size of 1, we probably d not need huge buffer.

That sounds reasonable. It's really about controller the memory usage of the internal buffer. FileStream has a similar optimization so I took from prior art. If you feel like we could mostly avoid the buffer completely in SSLStream without that, then I'll defer to you.

The old code would red 5byte header to know length of incoming frame before allocating the buffer. But that makes it quite inefficient.

It's much better from a memory POV which is why you should let the caller control it so that they can optimize for the scenario they care about. In ASP.NET Core, the Stream that SSLStream wraps doesn't make sys calls, it's buffered so that cost is low compared to if you were wrapping a NetworkStream and making sys calls per read.

Also note that the buffers are transient and rented from pool. That are not necessarily linked to life cycle of the stream.

Take a look at the memory profile above. These reads are long running and the buffers will remain rented from the pool for as long as these reads are pending, which is a long time. The memory allocated today in a typical websocket scenario on the server side is 32K (SSLStream) + 4K (Kestrel's buffer) * number of websocket connections.

We're extremely memory efficient without TLS for a number of reasons, but here's a working set comparison for the same workload:

5000 websocket connections without TLS - 200MB
5000 websoket connections with TLS - 800MB

This also exacerbates the fact that the Kestrel memory pool doesn't currently shrink (our problem not yours). But this is less of an issue with non-TLS because we basically only use the memory pool once IO is ready.

@wfurt wfurt removed the untriaged New issue has not been triaged by the area owner label Mar 2, 2021
@wfurt wfurt added this to the 6.0.0 milestone Mar 2, 2021
@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

I will take a look @davidfowl. I'm remote this week and off-grid next week. But I think we should be able to look into this in 6.0.

@geoffkizer
Copy link
Contributor

We need the entire frame buffered in order to decrypt it. So we can't allow the user to dictate the buffer size.

I don't understand this. Can you elaborate? Is it because the user has to pass a minimum size buffer in to decrypt the frame?

We have to pass the entire frame to SCHANNEL. Otherwise it can't decrypt the frame, and will give us an error like SEC_E_INCOMPLETE_MESSAGE.

@davidfowl
Copy link
Member Author

We have to pass the entire frame to SCHANNEL. Otherwise it can't decrypt the frame, and will give us an error like SEC_E_INCOMPLETE_MESSAGE.

I see, so we can read the frame header to get the length and determine if we can full fill the request using the user's buffer or decide if we need to need an internal buffer to complete the operation successfully.

@geoffkizer
Copy link
Contributor

One thing we could potentially do here is release the SslStream read buffer back to the pool when it becomes empty. This would add a small cost of acquiring/releasing the buffer in scenarios where we are reading in a tight loop, but it's probably not rhat expensive -- we should just measure to be sure.

That said, I'm not sure how much this helps in practice. It only helps when the user's read code does something like this:
(1) call SslStream.ReadAsync, causing us to allocate the internal buffer and do a read on the underlying stream to fill the internal buffer with at least one frame
(2) keep calling SslStream.ReadAsync until there's no data left to be consumed, causing the buffer to become empty and be released
(3) don't issue another SslStream.ReadAsync yet
(4) at some later point later, issue another SslStream.ReadAsync, which will go allocate the internal buffer again, etc...

It could be that WebSocket or something else that we care about would fit this pattern; not sure... thoughts?

@geoffkizer
Copy link
Contributor

so we can read the frame header to get the length

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

@davidfowl
Copy link
Member Author

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

See the comment above:

In ASP.NET Core, the Stream that SSLStream wraps doesn't make sys calls, it's buffered so that cost is low compared to if you were wrapping a NetworkStream and making sys calls per read.

@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

so we can read the frame header to get the length

We want to read as much as we can in a single read. Syscalls are expensive, as you know :)

That is trade-off IMHO read-ahead and maybe use more memory or try to read as much as possible.
Reading 5 byte header may not do syscall if the underlying stream reads ahead. Perhaps this could be configuration option if the underlying stream is memory.

It probably does not belong to SslStream but I have seen systems where the perf vs memory trade-off changes depending system state e.g. fast by default, cheap if resources are low.

@davidfowl
Copy link
Member Author

Stepping back for a minute, I would say that the BCL components do have to make tradeoffs to be as generic as possible, should also be as policy free as they can be and where they can't, let the caller choose. In this case, I'd like to tradeoff throughput for memory.

@geoffkizer
Copy link
Contributor

I'd like to tradeoff throughput for memory.

I agree, that's a reasonable thing to do; the trick is figuring out how to do this effectively.

@geoffkizer
Copy link
Contributor

I think it would help me if I understood better the scenarios where you would want to trade off throughput for memory. Is it for WebSockets that go idle? Something else?

@davidfowl
Copy link
Member Author

I think it would help me if I understood better the scenarios where you would want to trade off throughput for memory. Is it for WebSockets that go idle? Something else?

Yep, that's the typical scenario, lots of concurrent connections with very low traffic. Where a connected client/device is waiting for a server initiated notification. That's the scenario I'm currently testing.

We optimize for this scenario at multiple layers in the stack by doing zero byte reads, but we need to do it in a couple more places to make it more optimal.

@geoffkizer
Copy link
Contributor

We optimize for this scenario at multiple layers in the stack by doing zero byte reads

Yeah, I almost mentioned this... I wonder if some sort of zero byte read support in SslStream would be useful here.

@scalablecory
Copy link
Contributor

SslStream limitations seem somewhat recurring; exposing more primitives here seems worth exploring.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 2, 2021

SslStream allocations are high on the list for SqlClient performance as well, improvements here would be welcome.

@davidfowl
Copy link
Member Author

@geoffkizer

Yeah, I almost mentioned this... I wonder if some sort of zero byte read support in SslStream would be useful here.

It already supports it. #37122

I think you're onto something, here's a strawman:

  • We can use the 0 byte read as an indicator to only read the header and not the payload part of the TLS Record.
  • Once the header read completes, we store the length of the payload, we return to the caller who will provide a buffer to read the payload.
  • If the user provided buffer is big enough to fit the frame, we use it directly.
  • If the user provided buffer is too smaller than the frame, we rent an internal buffer (size TBD) and read the frame and copy it into the user buffer (and keep filling the buffer SSlStream reads TLS records one frame at a time #49000)

@geoffkizer
Copy link
Contributor

geoffkizer commented Mar 2, 2021

It already supports it. #37122

Didn't know that... awesome!

I think we could do something even simpler here...

When the user does a 0 byte read, and there's nothing in the internal buffer, then we can:
(1) release the existing buffer back to the ArrayPool
(2) Do a zero byte read on the underlying stream
(3) When that completes, complete the zero byte read to the user
(4) The next read operation will allocate the internal buffer from the ArrayPool again

Thoughts?

Edit: As you said above, we can use the 0 byte read as an "indicator" of user desire here... specifically that they don't want buffers tied up until there is data available, and they are willing to trade off at least a small amount of throughput for this.

@scalablecory
Copy link
Contributor

It already supports it. #37122

Didn't know that... awesome!

I think we could do something even simpler here...

When the user does a 0 byte read, and there's nothing in the internal buffer, then we can:
(1) release the existing buffer back to the ArrayPool
(2) Do a zero byte read on the underlying stream
(3) When that completes, complete the zero byte read to the user
(4) The next read operation will allocate the internal buffer from the ArrayPool again

Thoughts?

Keep in mind SslStream is built on top of Stream, not Socket... 0-byte reads are not part of Stream's guarantees. We'd either special case it for NetworkStream, or have some flag the user can set to enable the behavior.

@davidfowl
Copy link
Member Author

davidfowl commented Mar 3, 2021

@geoffkizer

Yes that works as well. I guess we'd end up allocating the buffer only until the entire frame is read.

@scalablecory

Keep in mind SslStream is built on top of Stream, not Socket... 0-byte reads are not part of Stream's guarantees. We'd either special case it for NetworkStream, or have some flag the user can set to enable the behavior.

Kestrel doesn't use a NetworkStream, that wouldn't be viable.

I discussed this with @stephentoub this morning and he was thinks it might be possible to retcon the definition to allow 0 byte reads to Streams to not be an error.

If we're not OK passing on the 0 byte read to the underlying stream then we could go with the approach where we read the header, its 5 bytes so it's within the range 😄

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2021

We'd either special case it for NetworkStream

Special-case what? SslStream provides the "wait for data" semantics if the stream it's wrapping does as well. So as long as Kestrel uses such a stream, everything "just works". If someone wants to substitute their own custom stream, they would need to support 0-byte reads as well. If they're using a stream today that doesn't and Kestrel starts depending on it, yeah, they could be broken, but ASP.NET has taken much larger breaking changes :)

I discussed this with @stephentoub this morning and he was thinks it might be possible to retcon the definition to allow 0 byte reads to Streams to not be an error.

0-byte reads shouldn't be an error. Behavior is either:

  • Wait for data
  • Return 0 immediately

There are basically three kinds of streams:

  1. Streams that represent connected endpoints, e.g. NetworkStream, PipeStream, FileStream around a connected handle.
  2. Streams that wrap a fixed data source, e.g. FileStream for a disk file, MemoryStream.
  3. Streams that delegate to other streams, e.g. SslStream, DeflateStream.

Almost every stream we have in category (1) supports 0-byte reads: I'm aware of one or two cases where that's not the case, but I think we can consider that a bug and just fix it (especially since in one case it's an inconsistency across platforms). (2) isn't relevant here. (3) is where I've seen the most inconsistency, but in most cases 0 isn't special-cased and it just exposes the underlying behavior.

I think we should update guidance around Streams to clarify expected behavior for 0-byte reads.

  • Category (1): the wait for data behavior
  • Category (2): return immediately
  • Category (3): behavior of the wrapped stream

and ensure our Streams map accordingly.

But, I don't think any of this really matters for the discussion at hand. Kestrel just needs to state that Streams it uses need to support 0-byte read behavior.

@geoffkizer
Copy link
Contributor

SslStream provides the "wait for data" semantics if the stream it's wrapping does as well.

I'm looking at the code and I don't see where it's actually relying on a zero-byte read from the underlying stream. That is, SslStream is correctly implementing "wait for data" but it's doing it by simply always reading into its internal buffer.

Code here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L754

So I think the win we could harvest here is to do the following:

If the user does a zero-byte read, and our internal buffer is empty (of both encrypted and decrypted data) and thus returned to the pool, then do a zero-byte read against the underlying stream and wait for it to return, before we allocate the internal buffer.

This would mean that we wouldn't be holding on to the internal buffer during the zero byte read.

Thoughts?

@stephentoub
Copy link
Member

That is, SslStream is correctly implementing "wait for data" but it's doing it by simply always reading into its internal buffer

Exactly. SslStream is in my category (3).

@stephentoub
Copy link
Member

stephentoub commented Mar 3, 2021

then do a zero-byte read against the underlying stream and wait for it to return, before we allocate the internal buffer

I think that's reasonable. We just need to be aware that we may be wrapping a stream that doesn't have the wait for data semantics. In practice I think what that really means is, whereas we might otherwise have chosen to do a synchronous read (since we know there's data available), we still need to do the subsequent read asynchronously, just in case there wasn't actually data available.

@geoffkizer
Copy link
Contributor

geoffkizer commented Mar 3, 2021

SslStream is in my category (3).

And to be clear, what are you proposing we do for category (3) in general, in terms of guidance/conformance tests/etc?

Are you saying these should do what I'm proposing for SslStream above? That is:

When the user does a zero-byte read, and the stream needs more data from the underlying stream, then the wrapping stream should:
(1) Issue a zero-byte read against the underlying stream
(2) Try to minimize buffer/resource (e.g. free internal buffers) usage during the time between when the user does that zero byte read and when the zero byte from the underlying stream completes

I can imagine adding a test to the conformance tests to this.

Edit: To be clear, a conformance test could test for (1) above; (2) is an implementation detail not directly visible to the user.

@geoffkizer
Copy link
Contributor

We just need to be aware that we may be wrapping a stream that doesn't have the wait for data semantics.

Yep, but as you say, the result here is benign and we would end up doing basically what we do today, just with the added cost of a zero-byte read that always succeeds immediately.

@stephentoub
Copy link
Member

And to be clear, what are you proposing we do for category (3) in general, in terms of guidance/conformance tests/etc?

Basically, don't add an upfront guard for if (count == 0) return 0;", or the equivalent. Instead, a wrapper stream asked for 0 bytes should end up asking the underlying stream for 0 bytes (if the wrapper didn't already have some buffered).

@geoffkizer
Copy link
Contributor

a wrapper stream asked for 0 bytes should end up asking the underlying stream for 0 bytes (if the wrapper didn't already have some buffered).

That's my bullet (1) above. I think suggesting (2) is useful as well, because that's what would actually help with the original scenario here.

don't add an upfront guard for if (count == 0) return 0;", or the equivalent.

I think it's more complicated than that in general; wrapping streams are typically managing internal buffers etc. But that's a detail.

@stephentoub
Copy link
Member

I think it's more complicated than that in general; wrapping streams are typically managing internal buffers etc. But that's a detail.

That's what I meant by "or the equivalent" :)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 4, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 7, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants