Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve buffer handling inside SslStream #49743

Closed
wants to merge 12 commits into from
Closed

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Mar 17, 2021

This is followup-up on discussion in #49000
This really have 2 separate change + some cleanup. I may be able to do them independently but they are somewhat intertwined so I did them and test them together.

  1. Is issue discussed in SSlStream reads TLS records one frame at a time #49000 to process multiple TLS frames at once when already present in internal buffer.
    Instead of doing it in PAL as original prototype this does it in ReadAsyncInternal() in platform independent (mostly) way.
    This had two basic challenges to that complicated the matter.
  • the decrypted data are always smaller (or frame may produce 0 bytes) so when decrypted, they would land in disjoined fragments. using single offset and sound become problematic. To solve this, I turned the input to separate input Span<byte>. That also avoids several checks and calculations for the offset. It also makes the code less confusing IMHO as we no longer use "output" variable as input. In general, once we read the frames there is no longer async boundary and using Span as much as possible is nice and easy.

  • While processing the extra frames we may reach end of stream. Either by actually reaching EOF on underlying stream or by reading end processing TLS close notification. To solve this I added extra bool so we can return previous data and we return EOF in next round.

  1. We can avoid allocating/renting 16k internal buffer in some cases. As part of the spanification above, I turned _internalBuffer into Memory<byte>. That either points to rented buffer or it can point to user provided buffer if that is large enough. That has three benefits:
  • we can reduce memory utilization. Even if rented, this can add up in aggregate use. We may need to allocate the buffer anyway if we end up with partial frames but that is no worse than now and we need to copy only the leftover instead of everything.
  • On Linux & macOS we can write decrypted data directly to the right location without any extra copy. That is not possible on Windows as Schannel always decrypt in place.
  • We can possibly do much bigger reads from underlying stream. Right now we are limited to 16K + what ever we get extra from rented buffer. Loopback on Linux is 64K so we may be be able to do much bigger reads. That is also true for TCP with large Window and fast network. When we use the provided buffer, user can indicate how much there are willing to trade. The example @davidfowl provided, I can see now 50K reads with no extract cost. (and no new API)

This starts in ResetReadBuffer() but with the Memory, the impact on rest of the code is pretty small.

  1. I kick few items out of the main processing loop. Processing already decrypted data should be no brainer. I also moved the zero byte processing. That gives caller to provide real buffer and avoid copy in some cases. I can possibly limit that with _internalBufferCount == 0 e.g. proceed to normal read if we have some data and buffer already allocated.

Testing:

The SchSendAux* test do depend on particular read size and they take it guess that the AUX record is present - Windows only. That seems generally problematic and since I did not find any good way how to do better I deleted the tests. I feel that is something the PAL is responsible for and I don't see big value in the tests.

Running the existing tests (SSL & HTTP) I found it difficult to hit some of the code branches. As produces write to SslStream we write it to Memory or socket and that shows up as whole on the other side. I added RandomIOStream that creates short reads and writes e.g. reads less then asked for and splits writes to multiple writes to underlying stream. That helped me to uncover interesting cases I may have missed otherwise.

I still plan to run more benchmarks but I would like to get some feedback before that is all completed.

fixes #49000
fixes #49086

contributes to (closed) #49019

@wfurt wfurt requested review from stephentoub and a team March 17, 2021 05:22
@wfurt wfurt self-assigned this Mar 17, 2021
@ghost
Copy link

ghost commented Mar 17, 2021

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

This is followup-up on discussion in #49000
This really have 2 separate change + some cleanup. I may be able to do them independently but they are somewhat intertwined so I did them and test them together.

  1. Is issue discussed in SSlStream reads TLS records one frame at a time #49000 to process multiple TLS frames at once when already present in internal buffer.
    Instead of doing it in PAL as original prototype this does it in ReadAsyncInternal() in platform independent (mostly) way.
    This had two basic challenges to that complicated the matter.
  • the decrypted data are always smaller (or frame may produce 0 bytes) so when decrypted, they would land in disjoined fragments. using single offset and sound become problematic. To solve this, I turned the input to separate input Span. That also avoids several checks and calculations for the offset. It also makes the code less confusing IMHO as we no longer use "output" variable as input. In general, once we read the frames there is no longer async boundary and using Span as much as possible is nice and easy.

  • While processing the extra frames we may reach end of stream. Either by actually reaching EOD on underlying stream or by reading end processing TLS close notification. To solve this I added extra bool so we can return previous data and we return EOF in next round.

  1. We can avoid allocating/renting 16k internal buffer in some cases. As part of the spanification above, I turned _internalBuffer into Memory. That either points to rented buffer or it can point to user provided buffer if that is large enough. That has three benefits:
  • we can reduce memory utilization. Even if rented, this can add up in aggregate use. We may need to allocate the buffer anyway if we end up with partial frames but that is no worse than now and we need to copy only the leftover instead of everything.
  • On Linux & macOS we can write decrypted data directly to the right location without any extra copy. That is not possible on Windows as Schannel always decrypt in place.
  • We can possibly do much bigger reads from underlying stream. Right now we are limited to 16K + what ever we get extra from rented buffer. Loopback on Linux is 64K so we may be be able to do much bigger reads. That is also true for TCP with large Window and fast network. When we use the provided buffer, user can indicate how much there are willing to trade. The example @davidfowl provided, I can see now 50K reads with no extract cost. (and no new API)

This starts in ResetReadBuffer() but with the Memory, the impact on rest of the code is pretty small.

  1. I kick few items out of the main processing loop. Processing already decrypted data should be no brainer. I also moved the zero byte processing. That gives caller to provide real buffer and avoid copy in some cases. I can possibly limit that with _internalBufferCount == 0 e.g. proceed to normal read if we have some data and buffer already allocated.

Testing:

The SchSendAux* test do depend on particular read size and they take it guess that the AUX record is present - Windows only. That seems generally problematic and since I did not find any good way how to do better I deleted the tests. I feel that is something the PAL is responsible for and I don't see big value in the tests.

Running the existing tests (SSL & HTTP) I found it difficult to hit some of the code branches. As produces write to SslStream we write it to Memory or socket and that shows up as whole on the other side. I added RandomIOStream that creates short reads and writes e.g. reads less then asked for and splits writes to multiple writes to underlying stream. That helped me to uncover interesting cases I may have missed otherwise.

I still plan to run more benchmarks but I would like to get some feedback before that is all completed.

fixes #49000
fixes #49086

contributes to (closed) #49019

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Security

Milestone: -

@@ -41,6 +41,7 @@ private enum Framing
private const int ReadBufferSize = 4096 * 4 + FrameOverhead; // We read in 16K chunks + headers.
private const int InitialHandshakeBufferSize = 4096 + FrameOverhead; // try to fit at least 4K ServerCertificate
private ArrayBuffer _handshakeBuffer;
private bool receivedEOF;
Copy link
Member

@davidfowl davidfowl Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _receivedEOF

@benaadams
Copy link
Member

Giving compile errors "Field 'SslStream._rentedBuffer' is never assigned to, and will always have its default value null"

@wfurt
Copy link
Member Author

wfurt commented Mar 17, 2021

I'm looking. _rentedBuffer is assigned in ResetReadBuffer().


Task reader = Task.Run(() =>
{
SHA256 hash = SHA256.Create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SHA256 hash = SHA256.Create();
using SHA256 hash = SHA256.Create();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had static's now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you can just do SHA256.HashData(...). No Create() and no dispose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works only when you have whole chunk. Here we feed TransformFinalBlock one chunk at the time.

Comment on lines +545 to +556
public override long Position
{
get
{
return _innerStream.Position;
}

set
{
_innerStream.Position = value;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public override long Position
{
get
{
return _innerStream.Position;
}
set
{
_innerStream.Position = value;
}
}
public override long Position
{
get => _innerStream.Position;
set => _innerStream.Position = value;
}

byte[] dataToCopy = RandomNumberGenerator.GetBytes(bufferSize);
byte[] srcHash = SHA256.Create().ComputeHash(dataToCopy);

var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost", };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost", };
var clientOptions = new SslClientAuthenticationOptions() { TargetHost = "localhost" };

internal int _internalOffset; // This is offset where encrypted data starts.
internal int _internalBufferCount; // Length of encrypted data.
internal int _decryptedBytesOffset; // Offset of decrypted data relative to _internalBuffer.
internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only ion combination with _rentedBuffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only ion combination with _rentedBuffer.
internal int _decryptedBytesCount; // Number of already decrypted bytes. Really valid only in combination with _rentedBuffer.

@geoffkizer
Copy link
Contributor

It feels to me like this PR is mixing together several changes that we should instead implement and measure independently.

In particular:
(1) Process multiple TLS frames in Read (#49000) -- This has little downside that I can see and we should just do it.
(2) Avoid extra copy of decrypted data on Linux/MacOS -- This seems like goodness, but it complicates the logic quite a bit here as it now works very differently for Linux vs Windows, and not just at the PAL layer. (In fact the PAL layer now works very differently for Linux and Windows, and that seems bad.) I think this would be nice to do, but I'd like to (a) do it in a way that doesn't introduce as much complexity, and (b) measure that there is actually a win here.
(3) Use user-provided buffer instead of internal rented buffer when possible. Again, this seems like goodness, but it also seems to introduce a lot of complexity. It's also not clear to me how much of a win this is, since in practice we are going to end up allocating the internal buffer in a lot of cases anyway.

I think it would be better to tackle each one of these independently (starting with (1)), validate the win in benchmarks or user scenarios etc, rinse and repeat.

Thoughts?

Also:

  • Using Span is definitely good, whether in the PAL or elsewhere -- this would have avoided that other issue we found recently
  • I'm fine deleting the SchSendAux tests
  • The RandomStream stuff looks cool!

@geoffkizer
Copy link
Contributor

In particular for (2):

I think we could have two functions at the PAL layer -- something like DecryptToBuffer and DecryptInPlace.

On Linux, DecryptInPlace would just call DecryptToBuffer with the original buffer.
On Windows, we'd always decrypt in place (since we have to) and then DecryptToBuffer would do the copy to the target buffer.

Thoughts?

@geoffkizer
Copy link
Contributor

I also wonder how some of these (particularly 2 and 3) would relate to implementing a custom BIO, like we have discussed in the past. Is this something we are still considering? How would we prioritize it relative to other stuff here?

@wfurt
Copy link
Member Author

wfurt commented Mar 18, 2021

macOS for example does not use BIO. (and we will need to re-write it when to go for ALPN & Tls13)
We may consider it for Linux but AFAIK the perf improvement was small when tried last time.

The 2 & 3 is essentially same code. For sake of this PR I can roll that back. We could split the Decrypt*() into two parts but Windows always decrypt in place and Unix always copy. I was originally thinking about hiding the difference in Windows PAL but we may end up with unnecessary copy. I'll think about it more and we can discuss details on the PR.

@geoffkizer
Copy link
Contributor

macOS for example does not use BIO. (and we will need to re-write it when to go for ALPN & Tls13)

Yeah, but we don't really care about perf for macOS. I mean if we can get a win on macOS too, then great, but it's not a priority.

We may consider it for Linux but AFAIK the perf improvement was small when tried last time.

As I understand it, we are copying all input data into the memory BIO today. A custom BIO would allow us to avoid this copy, and possibly have other benefits as well. It seems like the benefit of that would by definition be more significant than just (2) above. So if the BIO win is insignificant, then (2) is as well; conversely, if we think there may be a win here, then the BIO win seems like it would be larger than (2).

The 2 & 3 is essentially same code.

I don't understand this. (3) applies to both Linux and Windows, right? Whereas (2) is only a benefit for Linux.

I was originally thinking about hiding the difference in Windows PAL but we may end up with unnecessary copy. I'll think about it more and we can discuss details on the PR.

Sounds good. In general we should strive to encapsulate the platform differences in the PAL. This is not always possible, but if it's not then at the least we need to expose different plat-specific functions from the PAL, as opposed to having one that looks like it's platform neutral but actually is not.

@alnikola
Copy link
Contributor

I suspect it might be causing a perf regression in YARP HTTPS (HTTP 1.1) scenario. Let's investigate that before merging.

@davidfowl
Copy link
Member

I suspect it might be causing a perf regression in YARP HTTPS (HTTP 1.1) scenario. Let's investigate that before merging.

What's causing a regression?

@alnikola
Copy link
Contributor

This PR. I tried running YARP with a privately built System.Net.Security.dll and saw a perf drop along with some bad responses in YARP HTTPS - HTTP (HTTP 1.1) scenario on Windows.

@wfurt wfurt marked this pull request as draft May 11, 2021 19:18
@ghost ghost closed this Jun 10, 2021
@ghost
Copy link

ghost commented Jun 10, 2021

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@davidfowl
Copy link
Member

😢

@wfurt
Copy link
Member Author

wfurt commented Jun 11, 2021

It is still coming but slow @davidfowl

@ghost ghost locked as resolved and limited conversation to collaborators Jul 11, 2021
@karelz karelz added this to the 6.0.0 milestone Jul 15, 2021
@davidfowl
Copy link
Member

@wfurt Can we revive this?

@wfurt
Copy link
Member Author

wfurt commented Sep 26, 2022

yes, it is on my todo list for 8 @davidfowl. I think I need better plan for Windows.

@davidfowl
Copy link
Member

Is there an issue tracking it?

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants