-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSlStream reads TLS records one frame at a time #49000
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/ncl, @vcsjones Issue DetailsDescriptionSee dotnet/aspnetcore#30545 (comment) for more background on the issue. Kestrel does 4K buffer writes to SslStream on the server side and this results in only being able to read 4K buffers on the client side. The amount of TLS frames shouldn't affect how they are read on the client side. The extra cost here ends up being the amount of reads that need to be performed client side to quickly read data (paying for lots of async machinery ConfigurationI tested on Windows, I haven't tried on linux but haven't found any code that makes me believe this is a windows specific problem. Regression?Nope, seems to happen on .NET Core 3.1 as well. AnalysisThe loop here ends as soon as we have decrypted any data runtime/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs Line 766 in 79ae74f
Sample Programusing System;
using System.Buffers;
using System.IO;
using System.IO.Pipelines;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;
using System.Threading;
using System.Threading.Tasks;
var pair = DuplexPipe.CreateConnectionPair(new PipeOptions(), new PipeOptions());
var server = new SslStream(new DuplexPipeStream(pair.Application.Input, pair.Application.Output));
var client = new SslStream(new DuplexPipeStream(pair.Transport.Input, pair.Transport.Output));
async Task Server()
{
var certString = "MIIJrwIBAzCCCWsGCSqGSIb3DQEHAaCCCVwEgglYMIIJVDCCBfoGCSqGSIb3DQEHAaCCBesEggXnMIIF4zCCBd8GCyqGSIb3DQEMCgECoIIE/jCCBPowHAYKKoZIhvcNAQwBAzAOBAiaFr6Goa0S+gICB9AEggTY3sKr0dEZ+MINnW2jAcY2aSJTv1pyzQdBtUIEJO38ygeoTAhNCJcXGiyQGu8WZXid5wxgjs6gztSfK/chFvOq9O3ZG378ygBssFcY4vjFYwFddg6seMGMlxm/su6PPCvnRWX/YQ/YOgsHeEKVoHiLLPze05WRzNlt1OPjvmrgQUhY0drVXGdXXxqfPP/f+rMluCOA6qZUJqrMLQR4Eso9hyJhCi9ubQq61HTR+ZDFWjM/yofXe6y3SpwdCrr5zKz7Uu2BVq8R5zO50tjAHwo8znJkLmnrEN2rDUMGkAHMPyt4J4TwPtTxR7Ilk0+An6eriIBsx/tIDZnrlLva7qVHEh6z9LdPCa5u/4qFvisbkvJ+T68C7BX3e6TWLL7VZ4rbVMMlnv7YBebpDkQvrVoZ5V3GLsgKMY6a1ivlj9QFlZs7Oea0xyeFMMX4fh5N6j8Ap9nsySEbJZ6Kw6PuVqfwNfr3TjLlPHCDNPHv5q/Xmfdisgq9w1M1HLP9xkYrBhUu7pV/IklBfdGn4A8okI5AeAuF9Aumugq3IyOc3hpxwHp9T7821xDbYejXHGI30Ehg6N3ASl71bm99OXrh80NHt2Ogqo2OyeUEBIECIbnnEluEPsm3Lsgq5NP81VFOVdTbS/nhYMYLPJVNQIARMam00WWcx6HgpaR3NZkw1bR9njbA4MRPFkMmZRbEevN2/9myoCSN796/JO6V9itH18bbHA5JS1l5mRA0xarjaS8mz9J8fX2JOrxN5U/yBjZkrt/VxEkzmlFQvnpsOp63cnnEsgrjw2MC/7uZ2tRNpcBON3cjqPDpxoiyG+kOq/aAu0p7XJGkGbiXUyjkvwKZUbVktBETU7199oj2zPU9h2wcQ52NeiWEa8b5FveCjXwRz5kUp0VzbBlw1/4STyANZ/YQ6CoQ4bm/bLb+p8/SrjheTe2D2FDE19ZsEPzHFWHFYFlfKlOMed6+rZZnKCHT6wijRHyex9C4otOiQ9cDhj70x87sAWnTN9nZbcaczl7g3QsmOdRjLBqnrh5cQCOrPhDPPCwR0CocH035ASnoJByd4WiZeW19Bz27K78tuT7vlOT6fHyf4EQjhE/4YsgCmBKIwS6upRECheYo8FG8Gg5eY8mjk9p3i9DHjKxwh/de3U/s/0HEEz3Vj75yqTs6NjQYiD1BckKevm2B9Pq3Lb1ZpVOULJddiiMlWGu37nmzOTA1968CxkMJhvpPDvTwOpOnnv83H3xs6iJ7f7rV8jerCQxelNuRfa5E26XUbAP2/HkE5XDW/oA0e4f74fqLvyTUrw3IZwSgQN9Y0/FDrAHIVNpxiZ5LxjEg9EnVhfBh1vPw5LwcCgUuQ++rhSoW9WD5pygp8R1alDWprxbJmOclHt5uIcS7JHih8sonPXjUSXe9BWdSFcxsfBRJo8vRfeUj8Jt3x7HeqfoSDaAhSdZ3jtthkschDzMbI8KGKXZLgDaD3CJo+MZNe+cg5zmTfhYpHjKJW41P6XNnJLSxV5YfugLF9vE+JUEea17bOTiwxyTLqrCv3UuaW2YxDpK+5/i2hYalJJ0kwLlX0CJmcScpyp9dkgnPl9mwlt4exxBHY5WxMMq7f4z454CpmpJmuvjJCTDbS/dGGb+/DLxp7k51m3VJ1B40MeIweDGBzTATBgkqhkiG9w0BCRUxBgQEAQAAADBXBgkqhkiG9w0BCRQxSh5IADQAYgBhADMAOABlADAAMwAtAGEAOQBhAGMALQA0ADMAMAA3AC0AOAA5ADcAYQAtADQAMwBhADMAYwA0ADgANQBhADQANwA4MF0GCSsGAQQBgjcRATFQHk4ATQBpAGMAcgBvAHMAbwBmAHQAIABTAHQAcgBvAG4AZwAgAEMAcgB5AHAAdABvAGcAcgBhAHAAaABpAGMAIABQAHIAbwB2AGkAZABlAHIwggNSBgkqhkiG9w0BBwGgggNDBIIDPzCCAzswggM3BgsqhkiG9w0BDAoBA6CCAw8wggMLBgoqhkiG9w0BCRYBoIIC+wSCAvcwggLzMIIB36ADAgECAhAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQAwFDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTE1MDkxNjA2MjEwOVoXDTE2MDkxNjA2MjEwOFowFDESMBAGA1UEAxMJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAmlGZoaDWCOXZWDXlwaUJjGvfdpUpxYdoGxiHBiUowr/l+fjFq+xpqgjS/5VSXd64qq3ErofOc7yUrTFANMdZG6dVfzRaoHNXOZQjmvbGMOOOJQw4zc/Bt5DTqyeRt462gxLP/HkIE45NE8qRvsH6uU1BN1q/pHxsnkaeURqFcNgVR1f7BQfobj1hpneEBIYh75GRpSUo02VS5gGDc+B/sWB2fns7m05rnquYmpAQRQ7NyvoVn3JYj/EFN7RpdltqDytveMV50SSqLnKbcEmzBlRZWLXfdFNJjIwMBvI4z9nGkJFMvmn0qFAuPH+5WEZXN5F6OU83EQ0oAURfnJkEcwIDAQABo0kwRzBFBgNVHQEEPjA8gBAZzi0Ai4TG1VvIcjfkv3KIoRYwFDESMBAGA1UEAxMJbG9jYWxob3N0ghAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQADggEBAHGylUp5KGHHIlfAdmgcGZ89xtE5hpP0Wsd8KlE+HKfIDbt5SB7YBGl8gEMUAGMnVvBkRHtLmTJO0Ez3VOmTLCAOywOkduFXNsOjVkwPyCUnLlz7EriNpNmPT9ZWTVQdFV7FRXhjEgRXLCX+tHz1IF677MIx6kL47AmVKnr+9g+Fr5BfZDkNZpapJK3mflIdaM14jOz6rsqsWsJMCOlQPKcltq7BTKgIquNPDehHyFQjbMvMZg9Kf/YbpuJKFpCzfi+uiMyc5StBRVIF2buKQnvCiuTe6HUE6ZcrJxsEx7dSJ/lG5h3gPsfoNWxnTiomWV/Zp2E2YwfavjPhWjrBwrIxFTATBgkqhkiG9w0BCRUxBgQEAQAAADA7MB8wBwYFKw4DAhoEFK1iGPraW4Gf5vcA8VhmKYAjnMYSBBQ3NJrH26kpqUt0uX/tsyJX2WCqKAICB9A=";
var cert = new X509Certificate2(Convert.FromBase64String(certString), "testPassword");
await server.AuthenticateAsServerAsync(cert);
// Write 10500, 100 byte frames
for (int i = 0; i < 10500; i++)
{
await server.WriteAsync(new byte[100]);
}
}
async Task Client()
{
await client.AuthenticateAsClientAsync(new SslClientAuthenticationOptions
{
TargetHost = "localhost",
RemoteCertificateValidationCallback = (sender, certificate, chain, sslPolicyErrors) => true
});
// 1MB
var buffer = new byte[0x100000];
do
{
var read = await client.ReadAsync(buffer).ConfigureAwait(false);
if (read == 0)
{
break;
}
Console.WriteLine("Read: " + read);
}
while (true);
}
var s1 = Server();
var s2 = Client();
await Task.WhenAll(s1, s2);
internal class DuplexPipe : IDuplexPipe
{
public DuplexPipe(PipeReader reader, PipeWriter writer)
{
Input = reader;
Output = writer;
}
public PipeReader Input { get; }
public PipeWriter Output { get; }
public static DuplexPipePair CreateConnectionPair(PipeOptions inputOptions, PipeOptions outputOptions)
{
var input = new Pipe(inputOptions);
var output = new Pipe(outputOptions);
var transportToApplication = new DuplexPipe(output.Reader, input.Writer);
var applicationToTransport = new DuplexPipe(input.Reader, output.Writer);
return new DuplexPipePair(applicationToTransport, transportToApplication);
}
// This class exists to work around issues with value tuple on .NET Framework
public readonly struct DuplexPipePair
{
public IDuplexPipe Transport { get; }
public IDuplexPipe Application { get; }
public DuplexPipePair(IDuplexPipe transport, IDuplexPipe application)
{
Transport = transport;
Application = application;
}
}
}
internal class DuplexPipeStream : Stream
{
private readonly PipeReader _input;
private readonly PipeWriter _output;
private readonly bool _throwOnCancelled;
private volatile bool _cancelCalled;
public DuplexPipeStream(PipeReader input, PipeWriter output, bool throwOnCancelled = false)
{
_input = input;
_output = output;
_throwOnCancelled = throwOnCancelled;
}
public void CancelPendingRead()
{
_cancelCalled = true;
_input.CancelPendingRead();
}
public override bool CanRead => true;
public override bool CanSeek => false;
public override bool CanWrite => true;
public override long Length
{
get
{
throw new NotSupportedException();
}
}
public override long Position
{
get
{
throw new NotSupportedException();
}
set
{
throw new NotSupportedException();
}
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotSupportedException();
}
public override void SetLength(long value)
{
throw new NotSupportedException();
}
public override int Read(byte[] buffer, int offset, int count)
{
// ValueTask uses .GetAwaiter().GetResult() if necessary
// https://github.com/dotnet/corefx/blob/f9da3b4af08214764a51b2331f3595ffaf162abe/src/System.Threading.Tasks.Extensions/src/System/Threading/Tasks/ValueTask.cs#L156
return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), default).Result;
}
public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken = default)
{
return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), cancellationToken).AsTask();
}
public override ValueTask<int> ReadAsync(Memory<byte> destination, CancellationToken cancellationToken = default)
{
return ReadAsyncInternal(destination, cancellationToken);
}
public override void Write(byte[] buffer, int offset, int count)
{
WriteAsync(buffer, offset, count).GetAwaiter().GetResult();
}
public override async Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
{
if (buffer != null)
{
_output.Write(new ReadOnlySpan<byte>(buffer, offset, count));
}
await _output.FlushAsync(cancellationToken);
}
public override async ValueTask WriteAsync(ReadOnlyMemory<byte> source, CancellationToken cancellationToken = default)
{
_output.Write(source.Span);
await _output.FlushAsync(cancellationToken);
}
public override void Flush()
{
FlushAsync(CancellationToken.None).GetAwaiter().GetResult();
}
public override Task FlushAsync(CancellationToken cancellationToken)
{
return WriteAsync(null, 0, 0, cancellationToken);
}
private async ValueTask<int> ReadAsyncInternal(Memory<byte> destination, CancellationToken cancellationToken)
{
while (true)
{
var result = await _input.ReadAsync(cancellationToken);
var readableBuffer = result.Buffer;
try
{
if (_throwOnCancelled && result.IsCanceled && _cancelCalled)
{
// Reset the bool
_cancelCalled = false;
throw new OperationCanceledException();
}
if (!readableBuffer.IsEmpty)
{
// buffer.Count is int
var count = (int)Math.Min(readableBuffer.Length, destination.Length);
readableBuffer = readableBuffer.Slice(0, count);
readableBuffer.CopyTo(destination.Span);
return count;
}
if (result.IsCompleted)
{
return 0;
}
}
finally
{
_input.AdvanceTo(readableBuffer.End, readableBuffer.End);
}
}
}
public override IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
{
throw new NotSupportedException();
}
public override int EndRead(IAsyncResult asyncResult)
{
throw new NotSupportedException();
}
public override IAsyncResult BeginWrite(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
{
throw new NotSupportedException();
}
public override void EndWrite(IAsyncResult asyncResult)
{
throw new NotSupportedException();
}
}
|
Yeah, I agree, we could do better here. That said, I'm not sure how much of a potential win there is here in practice. As you say above, it's mostly about avoiding extra async machinery here. Should be relatively straightforward to measure. |
The issue filed in Kestrel was filed by a customer that experience slow downloads. My guess is that this also affects YARP performance for large responses. We should measure regardless though. |
We do see a drop off of performance in YARP compared to Envoy for 10k response size. I'm wondering if this is a factor there? |
Triage: @wfurt wants to do a BenchmarkDotNet test to see how much we could get. Based on estimated impact, we can decide if it is worth it in 6.0. |
I don't really see any downsides to doing this, assuming it won't need a huge code refactor, and so long as we don't introduce extra reads on the base stream. |
I did quick PoC to bundle the frames and still get only frame.Lenght of data back from OpenSSL. So there may be little bit more trickery and possibly OS dependency. |
What do you mean? You can't get the next frame after reading the first one? |
@wfurt can you post the code? |
So it goes like this @geoffkizer. When we feed multiple frames to OpenSSL, they would get queued in input BIO (note that outBuffer is also the actual inout) runtime/src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs Line 359 in e422345
When we try to read from the SSL it returns data only from first frame runtime/src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs Lines 366 to 369 in e422345
so the fix (in my prototype) was to check if input BIO has any unprocessed data and turn the crank again. With that I get what @davidfowl wants:
I see similar behavior on Windows. There was trickery as there is somewhat obscure mechanism to report unprocessed data. Part of the risk is fact that there are data buffered outside of SslStream visibility. Without the added retry data would still be consumable but the processing would hang as SslStream did not see any new data. We will need to be careful about it. |
It kind of makes sense that schannel/OpenSsl only process a single frame at a time. If they processed more than one, it would be harder to report the results. We know the frame size in SslStream, so we could just feed a single frame at a time to schannel/OpenSsl and (hopefully) not rely on their "unprocessed data" mechanism. That seems like it avoids most of the extra complication. Is that correct? |
I tried to write a benchmark to see if there are any measurable improvements but it is not stable enough. I tried to increase the loops and fiddle with various parameters but the variations are still much larger than measurable impact e.g. +-10%
I did not observe difference with the change on the other tests. Since the change is pretty small, and I could see possibility of impact on system calls, (e.g. transport other than memory) I'm inclined to move forward even if we cannot demonstrate improvement. cc: @stephentoub for any additional thoughts or hints on the benchmark. Linux changes can be viewed here: https://github.com/dotnet/runtime/compare/main...wfurt:bulkTls?expand=1 |
This correlates to the BioCtrlPending call?
How will this impact syscalls? I thought this was purely about avoiding traversing through a few layers of calls to get back to this place again. It doesn't change how many reads or writes or such on the underlying stream we're issuing, though, right?
Can you share it? |
https://github.com/dotnet/performance/compare/master...wfurt:sslBulk?expand=1 As far as the syscall. When somebody reads data from the stream, that may do something about it - like write to another descriptor and that can be also possibly visible on TCP. The overhead of traversing layers is negligible IMHO comparing to cost of the actual crypto. The BioCtrlPending should be cheap on memory stream. case BIO_CTRL_PENDING:
ret = (long)bm->length;
break; so it is mostly cost of the call. This can be possibly different on other platforms. |
I see. The argument is that the caller can't do the same thing because they don't know whether an attempt to fill their buffer after the first call will result in additional I/O / whether the SslStream already has decryptable data buffered. Ok, that's reasonable.
Can you try running the ASP.NET benchmarks with it, e.g. the ones for YARP? |
There's more overhead with wrapping streams as you pay the async overhead per read. For example, in the linked issue (dotnet/aspnetcore#30545), on HttpClient you pay for: ContentLengthReadStream.ReadAsync -> HttpConnection.ReadAsync -> SslStream.ReadAsync On the server side you pay for: HttpRequestStream -> Http1ContentLengthMessageBody -> SslStream |
In practice, does that show up as being measurable, in particular when compared to encryption/decryption costs? |
You also need to look at the cost of async reads from the underlying Stream right? This may result in more allocations as well, not just CPU cost. |
Why would or would not the presence of this result in more or fewer reads on the underlying stream? |
Assume the incoming user buffer was 1MB, you could imagine reading directly into the user buffer and decrypting in place. Then copying the remainder into the internal buffer. |
I'm missing something :) How does your scenario impact the number of reads on the underlying stream? |
The current fix doesn't, but in this scenario, you can do bigger reads on the underlying Stream because you have a bigger buffer 😄 (or maybe I'm missing something) |
Ok, sure, that just seems orthogonal to this issue. :) (Of course, if you were doing reads directly into the user's buffer, then you're more likely to want to want to aggressively decrypt what was in it, as you need to copy out of it anything that wasn't decrypted, and have room to store it.) |
Anyway, as noted above, I'd be interested in us running this through some real benchmarks (e.g. YARP), but the change is small and localized, and it's hard to come up with a reason why it would make things worse in any real use, so it should be fine to do if we suspect it might help. |
Most server CPUs have specific instructions and can encrypt & decrypt at more than 10Gbps per core; and an AMD Ryzen 7 1800X can do AES-128-GCM at 24Gbps per core and AES-256-GCM at 21GBps per core* *Non-hyperthreaded cores https://calomel.org/aesni_ssl_performance.html |
As I mentioned above:
This seems like it would avoid the call to BioCtrlPending entirely and simplify the code here. |
So basically loop inside ReadAsyncInternal and pay the overhead of multiple internal calls @geoffkizer? The benefit would be observed externally as ReadAsync would yield bigger chunks of data? |
Maybe use loop back instead of using pipelines. (I should have done that 😄 ) |
There is also test using NamedPipe. That may better represent reality as it does system calls and context switches. |
Yeah, exactly. I believe we will end up calling OpenSsl/SChannel the same number of times anyway, since they will only process one frame at a time. So it's really just about where we loop in our code. and I doubt that matters much in terms of perf. |
Re perf testing, this may be useful: https://github.com/geoffkizer/netperf/tree/master/SocketPerfTest |
This issue started with a customer complaint about slow file downloads in ASP.NET Core. I understand we want to isolate the change to avoid noise but maybe we should use the customer example as a way to vet the change? |
Do we have that code? Or are you saying that the code above is that code? |
There's some code in linked issue and there's also a scenario in YARP that might be interesting that is outlined there. |
Description
See dotnet/aspnetcore#30545 (comment) for more background on the issue. Kestrel does 4K buffer writes to SslStream on the server side and this results in only being able to read 4K buffers on the client side. The amount of TLS frames shouldn't affect how they are read on the client side. The extra cost here ends up being the amount of reads that need to be performed client side to quickly read data (paying for lots of async machinery)
Configuration
I tested on Windows, I haven't tried on linux but haven't found any code that makes me believe this is a windows specific problem.
Regression?
Nope, seems to happen on .NET Core 3.1 as well.
Analysis
The loop here ends as soon as we have decrypted any data
runtime/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs
Line 766 in 79ae74f
Sample Program
The text was updated successfully, but these errors were encountered: