improve buffer handling inside SslStream #49743

wfurt · 2021-03-17T05:22:24Z

This is followup-up on discussion in #49000
This really have 2 separate change + some cleanup. I may be able to do them independently but they are somewhat intertwined so I did them and test them together.

Is issue discussed in SSlStream reads TLS records one frame at a time #49000 to process multiple TLS frames at once when already present in internal buffer.
Instead of doing it in PAL as original prototype this does it in ReadAsyncInternal() in platform independent (mostly) way.
This had two basic challenges to that complicated the matter.

the decrypted data are always smaller (or frame may produce 0 bytes) so when decrypted, they would land in disjoined fragments. using single offset and sound become problematic. To solve this, I turned the input to separate input Span<byte>. That also avoids several checks and calculations for the offset. It also makes the code less confusing IMHO as we no longer use "output" variable as input. In general, once we read the frames there is no longer async boundary and using Span as much as possible is nice and easy.
While processing the extra frames we may reach end of stream. Either by actually reaching EOF on underlying stream or by reading end processing TLS close notification. To solve this I added extra bool so we can return previous data and we return EOF in next round.

We can avoid allocating/renting 16k internal buffer in some cases. As part of the spanification above, I turned _internalBuffer into Memory<byte>. That either points to rented buffer or it can point to user provided buffer if that is large enough. That has three benefits:

we can reduce memory utilization. Even if rented, this can add up in aggregate use. We may need to allocate the buffer anyway if we end up with partial frames but that is no worse than now and we need to copy only the leftover instead of everything.
On Linux & macOS we can write decrypted data directly to the right location without any extra copy. That is not possible on Windows as Schannel always decrypt in place.
We can possibly do much bigger reads from underlying stream. Right now we are limited to 16K + what ever we get extra from rented buffer. Loopback on Linux is 64K so we may be be able to do much bigger reads. That is also true for TCP with large Window and fast network. When we use the provided buffer, user can indicate how much there are willing to trade. The example @davidfowl provided, I can see now 50K reads with no extract cost. (and no new API)

This starts in ResetReadBuffer() but with the Memory, the impact on rest of the code is pretty small.

I kick few items out of the main processing loop. Processing already decrypted data should be no brainer. I also moved the zero byte processing. That gives caller to provide real buffer and avoid copy in some cases. I can possibly limit that with _internalBufferCount == 0 e.g. proceed to normal read if we have some data and buffer already allocated.

Testing:

The SchSendAux* test do depend on particular read size and they take it guess that the AUX record is present - Windows only. That seems generally problematic and since I did not find any good way how to do better I deleted the tests. I feel that is something the PAL is responsible for and I don't see big value in the tests.

Running the existing tests (SSL & HTTP) I found it difficult to hit some of the code branches. As produces write to SslStream we write it to Memory or socket and that shows up as whole on the other side. I added RandomIOStream that creates short reads and writes e.g. reads less then asked for and splits writes to multiple writes to underlying stream. That helped me to uncover interesting cases I may have missed otherwise.

I still plan to run more benchmarks but I would like to get some feedback before that is all completed.

fixes #49000
fixes #49086

contributes to (closed) #49019

ghost · 2021-03-17T05:22:35Z

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

This is followup-up on discussion in #49000
This really have 2 separate change + some cleanup. I may be able to do them independently but they are somewhat intertwined so I did them and test them together.

Is issue discussed in SSlStream reads TLS records one frame at a time #49000 to process multiple TLS frames at once when already present in internal buffer.
Instead of doing it in PAL as original prototype this does it in ReadAsyncInternal() in platform independent (mostly) way.
This had two basic challenges to that complicated the matter.

the decrypted data are always smaller (or frame may produce 0 bytes) so when decrypted, they would land in disjoined fragments. using single offset and sound become problematic. To solve this, I turned the input to separate input Span. That also avoids several checks and calculations for the offset. It also makes the code less confusing IMHO as we no longer use "output" variable as input. In general, once we read the frames there is no longer async boundary and using Span as much as possible is nice and easy.
While processing the extra frames we may reach end of stream. Either by actually reaching EOD on underlying stream or by reading end processing TLS close notification. To solve this I added extra bool so we can return previous data and we return EOF in next round.

We can avoid allocating/renting 16k internal buffer in some cases. As part of the spanification above, I turned _internalBuffer into Memory. That either points to rented buffer or it can point to user provided buffer if that is large enough. That has three benefits:

we can reduce memory utilization. Even if rented, this can add up in aggregate use. We may need to allocate the buffer anyway if we end up with partial frames but that is no worse than now and we need to copy only the leftover instead of everything.
On Linux & macOS we can write decrypted data directly to the right location without any extra copy. That is not possible on Windows as Schannel always decrypt in place.
We can possibly do much bigger reads from underlying stream. Right now we are limited to 16K + what ever we get extra from rented buffer. Loopback on Linux is 64K so we may be be able to do much bigger reads. That is also true for TCP with large Window and fast network. When we use the provided buffer, user can indicate how much there are willing to trade. The example @davidfowl provided, I can see now 50K reads with no extract cost. (and no new API)

This starts in ResetReadBuffer() but with the Memory, the impact on rest of the code is pretty small.

I kick few items out of the main processing loop. Processing already decrypted data should be no brainer. I also moved the zero byte processing. That gives caller to provide real buffer and avoid copy in some cases. I can possibly limit that with _internalBufferCount == 0 e.g. proceed to normal read if we have some data and buffer already allocated.

Testing:

The SchSendAux* test do depend on particular read size and they take it guess that the AUX record is present - Windows only. That seems generally problematic and since I did not find any good way how to do better I deleted the tests. I feel that is something the PAL is responsible for and I don't see big value in the tests.

Running the existing tests (SSL & HTTP) I found it difficult to hit some of the code branches. As produces write to SslStream we write it to Memory or socket and that shows up as whole on the other side. I added RandomIOStream that creates short reads and writes e.g. reads less then asked for and splits writes to multiple writes to underlying stream. That helped me to uncover interesting cases I may have missed otherwise.

I still plan to run more benchmarks but I would like to get some feedback before that is all completed.

fixes #49000
fixes #49086

contributes to (closed) #49019

Author:	wfurt
Assignees:	wfurt
Labels:	`area-System.Net.Security`
Milestone:	-

davidfowl · 2021-03-17T05:31:05Z

src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs

@@ -41,6 +41,7 @@ private enum Framing
        private const int ReadBufferSize = 4096 * 4 + FrameOverhead;         // We read in 16K chunks + headers.
        private const int InitialHandshakeBufferSize = 4096 + FrameOverhead; // try to fit at least 4K ServerCertificate
        private ArrayBuffer _handshakeBuffer;
+        private bool receivedEOF;


nit: _receivedEOF

benaadams · 2021-03-17T11:33:46Z

Giving compile errors "Field 'SslStream._rentedBuffer' is never assigned to, and will always have its default value null"

wfurt · 2021-03-17T16:11:20Z

I'm looking. _rentedBuffer is assigned in ResetReadBuffer().

…l_buffers

…sl_buffers

…l_buffers

danmoseley · 2021-03-18T04:03:38Z

src/libraries/System.Net.Security/tests/FunctionalTests/SslStreamNetworkStreamTest.cs

+
+                Task reader = Task.Run(() =>
+                {
+                    SHA256 hash = SHA256.Create();


Suggested change

SHA256 hash = SHA256.Create();

using SHA256 hash = SHA256.Create();

I thought we had static's now

Ah yes, you can just do SHA256.HashData(...). No Create() and no dispose.

That works only when you have whole chunk. Here we feed TransformFinalBlock one chunk at the time.

danmoseley · 2021-03-18T04:09:24Z

src/libraries/System.Net.Security/tests/FunctionalTests/SslStreamNetworkStreamTest.cs

+            public override long Position
+            {
+                get
+                {
+                    return _innerStream.Position;
+                }
+
+                set
+                {
+                    _innerStream.Position = value;
+                }
+            }


Suggested change

public override long Position

{

get

{

return _innerStream.Position;

}

set

{

_innerStream.Position = value;

}

}

public override long Position

{

get => _innerStream.Position;

set => _innerStream.Position = value;

}

danmoseley · 2021-03-18T04:11:01Z

src/libraries/System.Net.Security/tests/FunctionalTests/SslStreamNetworkStreamTest.cs

+            byte[] dataToCopy = RandomNumberGenerator.GetBytes(bufferSize);
+            byte[] srcHash = SHA256.Create().ComputeHash(dataToCopy);
+
+            var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost", };


Suggested change

var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost", };

var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost" };

danmoseley · 2021-03-18T04:12:54Z

src/libraries/System.Net.Security/src/System/Net/Security/SslStream.cs

+        internal int _internalOffset;               // This is offset where encrypted data starts.
+        internal int _internalBufferCount;          // Length of encrypted data.
+        internal int _decryptedBytesOffset;         // Offset of decrypted data relative to _internalBuffer.
+        internal int _decryptedBytesCount;          // Number of already decrypted bytes. Really valid only ion combination with _rentedBuffer.


Suggested change

internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only ion combination with _rentedBuffer.

internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only in combination with _rentedBuffer.

…sl_buffers

geoffkizer · 2021-03-18T16:58:54Z

It feels to me like this PR is mixing together several changes that we should instead implement and measure independently.

In particular:
(1) Process multiple TLS frames in Read (#49000) -- This has little downside that I can see and we should just do it.
(2) Avoid extra copy of decrypted data on Linux/MacOS -- This seems like goodness, but it complicates the logic quite a bit here as it now works very differently for Linux vs Windows, and not just at the PAL layer. (In fact the PAL layer now works very differently for Linux and Windows, and that seems bad.) I think this would be nice to do, but I'd like to (a) do it in a way that doesn't introduce as much complexity, and (b) measure that there is actually a win here.
(3) Use user-provided buffer instead of internal rented buffer when possible. Again, this seems like goodness, but it also seems to introduce a lot of complexity. It's also not clear to me how much of a win this is, since in practice we are going to end up allocating the internal buffer in a lot of cases anyway.

I think it would be better to tackle each one of these independently (starting with (1)), validate the win in benchmarks or user scenarios etc, rinse and repeat.

Thoughts?

Also:

Using Span is definitely good, whether in the PAL or elsewhere -- this would have avoided that other issue we found recently
I'm fine deleting the SchSendAux tests
The RandomStream stuff looks cool!

geoffkizer · 2021-03-18T17:05:39Z

In particular for (2):

I think we could have two functions at the PAL layer -- something like DecryptToBuffer and DecryptInPlace.

On Linux, DecryptInPlace would just call DecryptToBuffer with the original buffer.
On Windows, we'd always decrypt in place (since we have to) and then DecryptToBuffer would do the copy to the target buffer.

Thoughts?

geoffkizer · 2021-03-18T18:06:37Z

I also wonder how some of these (particularly 2 and 3) would relate to implementing a custom BIO, like we have discussed in the past. Is this something we are still considering? How would we prioritize it relative to other stuff here?

wfurt · 2021-03-18T19:00:10Z

macOS for example does not use BIO. (and we will need to re-write it when to go for ALPN & Tls13)
We may consider it for Linux but AFAIK the perf improvement was small when tried last time.

The 2 & 3 is essentially same code. For sake of this PR I can roll that back. We could split the Decrypt*() into two parts but Windows always decrypt in place and Unix always copy. I was originally thinking about hiding the difference in Windows PAL but we may end up with unnecessary copy. I'll think about it more and we can discuss details on the PR.

geoffkizer · 2021-03-19T04:16:22Z

macOS for example does not use BIO. (and we will need to re-write it when to go for ALPN & Tls13)

Yeah, but we don't really care about perf for macOS. I mean if we can get a win on macOS too, then great, but it's not a priority.

We may consider it for Linux but AFAIK the perf improvement was small when tried last time.

As I understand it, we are copying all input data into the memory BIO today. A custom BIO would allow us to avoid this copy, and possibly have other benefits as well. It seems like the benefit of that would by definition be more significant than just (2) above. So if the BIO win is insignificant, then (2) is as well; conversely, if we think there may be a win here, then the BIO win seems like it would be larger than (2).

The 2 & 3 is essentially same code.

I don't understand this. (3) applies to both Linux and Windows, right? Whereas (2) is only a benefit for Linux.

I was originally thinking about hiding the difference in Windows PAL but we may end up with unnecessary copy. I'll think about it more and we can discuss details on the PR.

Sounds good. In general we should strive to encapsulate the platform differences in the PAL. This is not always possible, but if it's not then at the least we need to expose different plat-specific functions from the PAL, as opposed to having one that looks like it's platform neutral but actually is not.

alnikola · 2021-03-22T15:06:49Z

I suspect it might be causing a perf regression in YARP HTTPS (HTTP 1.1) scenario. Let's investigate that before merging.

davidfowl · 2021-03-22T15:30:43Z

I suspect it might be causing a perf regression in YARP HTTPS (HTTP 1.1) scenario. Let's investigate that before merging.

What's causing a regression?

alnikola · 2021-03-22T17:37:51Z

This PR. I tried running YARP with a privately built System.Net.Security.dll and saw a perf drop along with some bad responses in YARP HTTPS - HTTP (HTTP 1.1) scenario on Windows.

ghost · 2021-06-10T22:00:10Z

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

davidfowl · 2021-06-10T22:23:03Z

😢

wfurt · 2021-06-11T09:15:38Z

It is still coming but slow @davidfowl

davidfowl · 2022-09-24T05:03:17Z

@wfurt Can we revive this?

wfurt · 2022-09-26T04:38:26Z

yes, it is on my todo list for 8 @davidfowl. I think I need better plan for Windows.

davidfowl · 2022-09-26T04:57:02Z

Is there an issue tracking it?

wfurt added 2 commits March 15, 2021 23:54

improve buffer handling inside SslStream

5898397

updates

330e24a

wfurt added the area-System.Net.Security label Mar 17, 2021

wfurt requested review from stephentoub and a team March 17, 2021 05:22

wfurt self-assigned this Mar 17, 2021

davidfowl reviewed Mar 17, 2021

View reviewed changes

wfurt added 8 commits March 17, 2021 09:49

fix build error with fakes

f7bbc5a

add FakeSslStream

fdaa7f3

fix TLS13 on Windows

421191f

Merge branch 'ssl_buffers' of http://github.com/wfurt/runtime into ss…

55e2d7d

…l_buffers

attempt to fix zero read tests

fca5eb4

Merge branch 'ssl_buffers' of https://github.com/wfurt/runtime into s…

058e835

…sl_buffers

wix winhttp

c6af5e7

Merge branch 'ssl_buffers' of http://github.com/wfurt/runtime into ss…

190d35c

…l_buffers

danmoseley reviewed Mar 18, 2021

View reviewed changes

wfurt added 2 commits March 17, 2021 22:07

make CreateWrappedConnectedStreamsAsync more resilient

1e94345

Merge branch 'ssl_buffers' of https://github.com/wfurt/runtime into s…

a3f22b7

…sl_buffers

wfurt mentioned this pull request Apr 6, 2021

process more TLS frames at one when available #50815

Merged

wfurt marked this pull request as draft May 11, 2021 19:18

wfurt mentioned this pull request May 13, 2021

add SslStream_RandomWrites_OK test #52682

Merged

ghost closed this Jun 10, 2021

ghost locked as resolved and limited conversation to collaborators Jul 11, 2021

karelz added this to the 6.0.0 milestone Jul 15, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve buffer handling inside SslStream #49743

improve buffer handling inside SslStream #49743

wfurt commented Mar 17, 2021 •

edited

Loading

ghost commented Mar 17, 2021

davidfowl Mar 17, 2021 •

edited

Loading

benaadams commented Mar 17, 2021

wfurt commented Mar 17, 2021

danmoseley Mar 18, 2021

davidfowl Mar 18, 2021

danmoseley Mar 18, 2021

wfurt Mar 18, 2021

danmoseley Mar 18, 2021

danmoseley Mar 18, 2021

danmoseley Mar 18, 2021

geoffkizer commented Mar 18, 2021

geoffkizer commented Mar 18, 2021

geoffkizer commented Mar 18, 2021

wfurt commented Mar 18, 2021

geoffkizer commented Mar 19, 2021

alnikola commented Mar 22, 2021

davidfowl commented Mar 22, 2021

alnikola commented Mar 22, 2021

ghost commented Jun 10, 2021

davidfowl commented Jun 10, 2021

wfurt commented Jun 11, 2021

davidfowl commented Sep 24, 2022

wfurt commented Sep 26, 2022

davidfowl commented Sep 26, 2022

	SHA256 hash = SHA256.Create();
	using SHA256 hash = SHA256.Create();

	var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost", };
	var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost" };

	internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only ion combination with _rentedBuffer.
	internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only in combination with _rentedBuffer.

improve buffer handling inside SslStream #49743

improve buffer handling inside SslStream #49743

Conversation

wfurt commented Mar 17, 2021 • edited Loading

ghost commented Mar 17, 2021

davidfowl Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

benaadams commented Mar 17, 2021

wfurt commented Mar 17, 2021

danmoseley Mar 18, 2021

Choose a reason for hiding this comment

davidfowl Mar 18, 2021

Choose a reason for hiding this comment

danmoseley Mar 18, 2021

Choose a reason for hiding this comment

wfurt Mar 18, 2021

Choose a reason for hiding this comment

danmoseley Mar 18, 2021

Choose a reason for hiding this comment

danmoseley Mar 18, 2021

Choose a reason for hiding this comment

danmoseley Mar 18, 2021

Choose a reason for hiding this comment

geoffkizer commented Mar 18, 2021

geoffkizer commented Mar 18, 2021

geoffkizer commented Mar 18, 2021

wfurt commented Mar 18, 2021

geoffkizer commented Mar 19, 2021

alnikola commented Mar 22, 2021

davidfowl commented Mar 22, 2021

alnikola commented Mar 22, 2021

ghost commented Jun 10, 2021

davidfowl commented Jun 10, 2021

wfurt commented Jun 11, 2021

davidfowl commented Sep 24, 2022

wfurt commented Sep 26, 2022

davidfowl commented Sep 26, 2022

wfurt commented Mar 17, 2021 •

edited

Loading

davidfowl Mar 17, 2021 •

edited

Loading