Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSlStream reads TLS records one frame at a time #49000

Closed
davidfowl opened this issue Mar 2, 2021 · 37 comments · Fixed by #50815
Closed

SSlStream reads TLS records one frame at a time #49000

davidfowl opened this issue Mar 2, 2021 · 37 comments · Fixed by #50815
Assignees
Labels
area-System.Net.Security enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@davidfowl
Copy link
Member

davidfowl commented Mar 2, 2021

Description

See dotnet/aspnetcore#30545 (comment) for more background on the issue. Kestrel does 4K buffer writes to SslStream on the server side and this results in only being able to read 4K buffers on the client side. The amount of TLS frames shouldn't affect how they are read on the client side. The extra cost here ends up being the amount of reads that need to be performed client side to quickly read data (paying for lots of async machinery)

Configuration

I tested on Windows, I haven't tried on linux but haven't found any code that makes me believe this is a windows specific problem.

Regression?

Nope, seems to happen on .NET Core 3.1 as well.

Analysis

The loop here ends as soon as we have decrypted any data

. Though this isn't a stream contract requirement, the loop should end in 3 cases:

  1. We don't have another TLS frame and haven't seen EOF.
  2. We have seen EOF
  3. We have run out of space to fill the user's buffer.

Sample Program

using System;
using System.Buffers;
using System.IO;
using System.IO.Pipelines;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;
using System.Threading;
using System.Threading.Tasks;

var pair = DuplexPipe.CreateConnectionPair(new PipeOptions(), new PipeOptions());

var server = new SslStream(new DuplexPipeStream(pair.Application.Input, pair.Application.Output));
var client = new SslStream(new DuplexPipeStream(pair.Transport.Input, pair.Transport.Output));

async Task Server()
{
    var certString = "MIIJrwIBAzCCCWsGCSqGSIb3DQEHAaCCCVwEgglYMIIJVDCCBfoGCSqGSIb3DQEHAaCCBesEggXnMIIF4zCCBd8GCyqGSIb3DQEMCgECoIIE/jCCBPowHAYKKoZIhvcNAQwBAzAOBAiaFr6Goa0S+gICB9AEggTY3sKr0dEZ+MINnW2jAcY2aSJTv1pyzQdBtUIEJO38ygeoTAhNCJcXGiyQGu8WZXid5wxgjs6gztSfK/chFvOq9O3ZG378ygBssFcY4vjFYwFddg6seMGMlxm/su6PPCvnRWX/YQ/YOgsHeEKVoHiLLPze05WRzNlt1OPjvmrgQUhY0drVXGdXXxqfPP/f+rMluCOA6qZUJqrMLQR4Eso9hyJhCi9ubQq61HTR+ZDFWjM/yofXe6y3SpwdCrr5zKz7Uu2BVq8R5zO50tjAHwo8znJkLmnrEN2rDUMGkAHMPyt4J4TwPtTxR7Ilk0+An6eriIBsx/tIDZnrlLva7qVHEh6z9LdPCa5u/4qFvisbkvJ+T68C7BX3e6TWLL7VZ4rbVMMlnv7YBebpDkQvrVoZ5V3GLsgKMY6a1ivlj9QFlZs7Oea0xyeFMMX4fh5N6j8Ap9nsySEbJZ6Kw6PuVqfwNfr3TjLlPHCDNPHv5q/Xmfdisgq9w1M1HLP9xkYrBhUu7pV/IklBfdGn4A8okI5AeAuF9Aumugq3IyOc3hpxwHp9T7821xDbYejXHGI30Ehg6N3ASl71bm99OXrh80NHt2Ogqo2OyeUEBIECIbnnEluEPsm3Lsgq5NP81VFOVdTbS/nhYMYLPJVNQIARMam00WWcx6HgpaR3NZkw1bR9njbA4MRPFkMmZRbEevN2/9myoCSN796/JO6V9itH18bbHA5JS1l5mRA0xarjaS8mz9J8fX2JOrxN5U/yBjZkrt/VxEkzmlFQvnpsOp63cnnEsgrjw2MC/7uZ2tRNpcBON3cjqPDpxoiyG+kOq/aAu0p7XJGkGbiXUyjkvwKZUbVktBETU7199oj2zPU9h2wcQ52NeiWEa8b5FveCjXwRz5kUp0VzbBlw1/4STyANZ/YQ6CoQ4bm/bLb+p8/SrjheTe2D2FDE19ZsEPzHFWHFYFlfKlOMed6+rZZnKCHT6wijRHyex9C4otOiQ9cDhj70x87sAWnTN9nZbcaczl7g3QsmOdRjLBqnrh5cQCOrPhDPPCwR0CocH035ASnoJByd4WiZeW19Bz27K78tuT7vlOT6fHyf4EQjhE/4YsgCmBKIwS6upRECheYo8FG8Gg5eY8mjk9p3i9DHjKxwh/de3U/s/0HEEz3Vj75yqTs6NjQYiD1BckKevm2B9Pq3Lb1ZpVOULJddiiMlWGu37nmzOTA1968CxkMJhvpPDvTwOpOnnv83H3xs6iJ7f7rV8jerCQxelNuRfa5E26XUbAP2/HkE5XDW/oA0e4f74fqLvyTUrw3IZwSgQN9Y0/FDrAHIVNpxiZ5LxjEg9EnVhfBh1vPw5LwcCgUuQ++rhSoW9WD5pygp8R1alDWprxbJmOclHt5uIcS7JHih8sonPXjUSXe9BWdSFcxsfBRJo8vRfeUj8Jt3x7HeqfoSDaAhSdZ3jtthkschDzMbI8KGKXZLgDaD3CJo+MZNe+cg5zmTfhYpHjKJW41P6XNnJLSxV5YfugLF9vE+JUEea17bOTiwxyTLqrCv3UuaW2YxDpK+5/i2hYalJJ0kwLlX0CJmcScpyp9dkgnPl9mwlt4exxBHY5WxMMq7f4z454CpmpJmuvjJCTDbS/dGGb+/DLxp7k51m3VJ1B40MeIweDGBzTATBgkqhkiG9w0BCRUxBgQEAQAAADBXBgkqhkiG9w0BCRQxSh5IADQAYgBhADMAOABlADAAMwAtAGEAOQBhAGMALQA0ADMAMAA3AC0AOAA5ADcAYQAtADQAMwBhADMAYwA0ADgANQBhADQANwA4MF0GCSsGAQQBgjcRATFQHk4ATQBpAGMAcgBvAHMAbwBmAHQAIABTAHQAcgBvAG4AZwAgAEMAcgB5AHAAdABvAGcAcgBhAHAAaABpAGMAIABQAHIAbwB2AGkAZABlAHIwggNSBgkqhkiG9w0BBwGgggNDBIIDPzCCAzswggM3BgsqhkiG9w0BDAoBA6CCAw8wggMLBgoqhkiG9w0BCRYBoIIC+wSCAvcwggLzMIIB36ADAgECAhAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQAwFDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTE1MDkxNjA2MjEwOVoXDTE2MDkxNjA2MjEwOFowFDESMBAGA1UEAxMJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAmlGZoaDWCOXZWDXlwaUJjGvfdpUpxYdoGxiHBiUowr/l+fjFq+xpqgjS/5VSXd64qq3ErofOc7yUrTFANMdZG6dVfzRaoHNXOZQjmvbGMOOOJQw4zc/Bt5DTqyeRt462gxLP/HkIE45NE8qRvsH6uU1BN1q/pHxsnkaeURqFcNgVR1f7BQfobj1hpneEBIYh75GRpSUo02VS5gGDc+B/sWB2fns7m05rnquYmpAQRQ7NyvoVn3JYj/EFN7RpdltqDytveMV50SSqLnKbcEmzBlRZWLXfdFNJjIwMBvI4z9nGkJFMvmn0qFAuPH+5WEZXN5F6OU83EQ0oAURfnJkEcwIDAQABo0kwRzBFBgNVHQEEPjA8gBAZzi0Ai4TG1VvIcjfkv3KIoRYwFDESMBAGA1UEAxMJbG9jYWxob3N0ghAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQADggEBAHGylUp5KGHHIlfAdmgcGZ89xtE5hpP0Wsd8KlE+HKfIDbt5SB7YBGl8gEMUAGMnVvBkRHtLmTJO0Ez3VOmTLCAOywOkduFXNsOjVkwPyCUnLlz7EriNpNmPT9ZWTVQdFV7FRXhjEgRXLCX+tHz1IF677MIx6kL47AmVKnr+9g+Fr5BfZDkNZpapJK3mflIdaM14jOz6rsqsWsJMCOlQPKcltq7BTKgIquNPDehHyFQjbMvMZg9Kf/YbpuJKFpCzfi+uiMyc5StBRVIF2buKQnvCiuTe6HUE6ZcrJxsEx7dSJ/lG5h3gPsfoNWxnTiomWV/Zp2E2YwfavjPhWjrBwrIxFTATBgkqhkiG9w0BCRUxBgQEAQAAADA7MB8wBwYFKw4DAhoEFK1iGPraW4Gf5vcA8VhmKYAjnMYSBBQ3NJrH26kpqUt0uX/tsyJX2WCqKAICB9A=";
    var cert = new X509Certificate2(Convert.FromBase64String(certString), "testPassword");
    await server.AuthenticateAsServerAsync(cert);

    // Write 10500, 100 byte frames
    for (int i = 0; i < 10500; i++)
    {
        await server.WriteAsync(new byte[100]);
    }
}

async Task Client()
{
    await client.AuthenticateAsClientAsync(new SslClientAuthenticationOptions
    {
        TargetHost = "localhost",
        RemoteCertificateValidationCallback = (sender, certificate, chain, sslPolicyErrors) => true
    });

    // 1MB
    var buffer = new byte[0x100000];

    do
    {
        var read = await client.ReadAsync(buffer).ConfigureAwait(false);
        if (read == 0)
        {
            break;
        }

        Console.WriteLine("Read: " + read);
    }
    while (true);
}

var s1 = Server();
var s2 = Client();
await Task.WhenAll(s1, s2);


internal class DuplexPipe : IDuplexPipe
{
    public DuplexPipe(PipeReader reader, PipeWriter writer)
    {
        Input = reader;
        Output = writer;
    }

    public PipeReader Input { get; }

    public PipeWriter Output { get; }

    public static DuplexPipePair CreateConnectionPair(PipeOptions inputOptions, PipeOptions outputOptions)
    {
        var input = new Pipe(inputOptions);
        var output = new Pipe(outputOptions);

        var transportToApplication = new DuplexPipe(output.Reader, input.Writer);
        var applicationToTransport = new DuplexPipe(input.Reader, output.Writer);

        return new DuplexPipePair(applicationToTransport, transportToApplication);
    }

    // This class exists to work around issues with value tuple on .NET Framework
    public readonly struct DuplexPipePair
    {
        public IDuplexPipe Transport { get; }
        public IDuplexPipe Application { get; }

        public DuplexPipePair(IDuplexPipe transport, IDuplexPipe application)
        {
            Transport = transport;
            Application = application;
        }
    }
}

internal class DuplexPipeStream : Stream
{
    private readonly PipeReader _input;
    private readonly PipeWriter _output;
    private readonly bool _throwOnCancelled;
    private volatile bool _cancelCalled;

    public DuplexPipeStream(PipeReader input, PipeWriter output, bool throwOnCancelled = false)
    {
        _input = input;
        _output = output;
        _throwOnCancelled = throwOnCancelled;
    }

    public void CancelPendingRead()
    {
        _cancelCalled = true;
        _input.CancelPendingRead();
    }

    public override bool CanRead => true;

    public override bool CanSeek => false;

    public override bool CanWrite => true;

    public override long Length
    {
        get
        {
            throw new NotSupportedException();
        }
    }

    public override long Position
    {
        get
        {
            throw new NotSupportedException();
        }
        set
        {
            throw new NotSupportedException();
        }
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotSupportedException();
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        // ValueTask uses .GetAwaiter().GetResult() if necessary
        // https://github.com/dotnet/corefx/blob/f9da3b4af08214764a51b2331f3595ffaf162abe/src/System.Threading.Tasks.Extensions/src/System/Threading/Tasks/ValueTask.cs#L156
        return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), default).Result;
    }

    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken = default)
    {
        return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), cancellationToken).AsTask();
    }

    public override ValueTask<int> ReadAsync(Memory<byte> destination, CancellationToken cancellationToken = default)
    {
        return ReadAsyncInternal(destination, cancellationToken);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        WriteAsync(buffer, offset, count).GetAwaiter().GetResult();
    }

    public override async Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        if (buffer != null)
        {
            _output.Write(new ReadOnlySpan<byte>(buffer, offset, count));
        }

        await _output.FlushAsync(cancellationToken);
    }

    public override async ValueTask WriteAsync(ReadOnlyMemory<byte> source, CancellationToken cancellationToken = default)
    {
        _output.Write(source.Span);
        await _output.FlushAsync(cancellationToken);
    }

    public override void Flush()
    {
        FlushAsync(CancellationToken.None).GetAwaiter().GetResult();
    }

    public override Task FlushAsync(CancellationToken cancellationToken)
    {
        return WriteAsync(null, 0, 0, cancellationToken);
    }

    private async ValueTask<int> ReadAsyncInternal(Memory<byte> destination, CancellationToken cancellationToken)
    {
        while (true)
        {
            var result = await _input.ReadAsync(cancellationToken);
            var readableBuffer = result.Buffer;
            try
            {
                if (_throwOnCancelled && result.IsCanceled && _cancelCalled)
                {
                    // Reset the bool
                    _cancelCalled = false;
                    throw new OperationCanceledException();
                }

                if (!readableBuffer.IsEmpty)
                {
                    // buffer.Count is int
                    var count = (int)Math.Min(readableBuffer.Length, destination.Length);
                    readableBuffer = readableBuffer.Slice(0, count);
                    readableBuffer.CopyTo(destination.Span);
                    return count;
                }

                if (result.IsCompleted)
                {
                    return 0;
                }
            }
            finally
            {
                _input.AdvanceTo(readableBuffer.End, readableBuffer.End);
            }
        }
    }

    public override IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
    {
        throw new NotSupportedException();
    }

    public override int EndRead(IAsyncResult asyncResult)
    {
        throw new NotSupportedException();
    }

    public override IAsyncResult BeginWrite(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
    {
        throw new NotSupportedException();
    }

    public override void EndWrite(IAsyncResult asyncResult)
    {
        throw new NotSupportedException();
    }
}
@davidfowl davidfowl added the tenet-performance Performance related issue label Mar 2, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 2, 2021
@ghost
Copy link

ghost commented Mar 2, 2021

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

See dotnet/aspnetcore#30545 (comment) for more background on the issue. Kestrel does 4K buffer writes to SslStream on the server side and this results in only being able to read 4K buffers on the client side. The amount of TLS frames shouldn't affect how they are read on the client side. The extra cost here ends up being the amount of reads that need to be performed client side to quickly read data (paying for lots of async machinery

Configuration

I tested on Windows, I haven't tried on linux but haven't found any code that makes me believe this is a windows specific problem.

Regression?

Nope, seems to happen on .NET Core 3.1 as well.

Analysis

The loop here ends as soon as we have decrypted any data

. Though this isn't a stream contract requirement, the loop should end in 3 cases:

  1. We don't have another TLS frame and haven't seen EOF.
  2. We have seen EOF
  3. We have run out of space to fill the user's buffer.

Sample Program

using System;
using System.Buffers;
using System.IO;
using System.IO.Pipelines;
using System.Net.Security;
using System.Security.Cryptography.X509Certificates;
using System.Threading;
using System.Threading.Tasks;

var pair = DuplexPipe.CreateConnectionPair(new PipeOptions(), new PipeOptions());

var server = new SslStream(new DuplexPipeStream(pair.Application.Input, pair.Application.Output));
var client = new SslStream(new DuplexPipeStream(pair.Transport.Input, pair.Transport.Output));

async Task Server()
{
    var certString = "MIIJrwIBAzCCCWsGCSqGSIb3DQEHAaCCCVwEgglYMIIJVDCCBfoGCSqGSIb3DQEHAaCCBesEggXnMIIF4zCCBd8GCyqGSIb3DQEMCgECoIIE/jCCBPowHAYKKoZIhvcNAQwBAzAOBAiaFr6Goa0S+gICB9AEggTY3sKr0dEZ+MINnW2jAcY2aSJTv1pyzQdBtUIEJO38ygeoTAhNCJcXGiyQGu8WZXid5wxgjs6gztSfK/chFvOq9O3ZG378ygBssFcY4vjFYwFddg6seMGMlxm/su6PPCvnRWX/YQ/YOgsHeEKVoHiLLPze05WRzNlt1OPjvmrgQUhY0drVXGdXXxqfPP/f+rMluCOA6qZUJqrMLQR4Eso9hyJhCi9ubQq61HTR+ZDFWjM/yofXe6y3SpwdCrr5zKz7Uu2BVq8R5zO50tjAHwo8znJkLmnrEN2rDUMGkAHMPyt4J4TwPtTxR7Ilk0+An6eriIBsx/tIDZnrlLva7qVHEh6z9LdPCa5u/4qFvisbkvJ+T68C7BX3e6TWLL7VZ4rbVMMlnv7YBebpDkQvrVoZ5V3GLsgKMY6a1ivlj9QFlZs7Oea0xyeFMMX4fh5N6j8Ap9nsySEbJZ6Kw6PuVqfwNfr3TjLlPHCDNPHv5q/Xmfdisgq9w1M1HLP9xkYrBhUu7pV/IklBfdGn4A8okI5AeAuF9Aumugq3IyOc3hpxwHp9T7821xDbYejXHGI30Ehg6N3ASl71bm99OXrh80NHt2Ogqo2OyeUEBIECIbnnEluEPsm3Lsgq5NP81VFOVdTbS/nhYMYLPJVNQIARMam00WWcx6HgpaR3NZkw1bR9njbA4MRPFkMmZRbEevN2/9myoCSN796/JO6V9itH18bbHA5JS1l5mRA0xarjaS8mz9J8fX2JOrxN5U/yBjZkrt/VxEkzmlFQvnpsOp63cnnEsgrjw2MC/7uZ2tRNpcBON3cjqPDpxoiyG+kOq/aAu0p7XJGkGbiXUyjkvwKZUbVktBETU7199oj2zPU9h2wcQ52NeiWEa8b5FveCjXwRz5kUp0VzbBlw1/4STyANZ/YQ6CoQ4bm/bLb+p8/SrjheTe2D2FDE19ZsEPzHFWHFYFlfKlOMed6+rZZnKCHT6wijRHyex9C4otOiQ9cDhj70x87sAWnTN9nZbcaczl7g3QsmOdRjLBqnrh5cQCOrPhDPPCwR0CocH035ASnoJByd4WiZeW19Bz27K78tuT7vlOT6fHyf4EQjhE/4YsgCmBKIwS6upRECheYo8FG8Gg5eY8mjk9p3i9DHjKxwh/de3U/s/0HEEz3Vj75yqTs6NjQYiD1BckKevm2B9Pq3Lb1ZpVOULJddiiMlWGu37nmzOTA1968CxkMJhvpPDvTwOpOnnv83H3xs6iJ7f7rV8jerCQxelNuRfa5E26XUbAP2/HkE5XDW/oA0e4f74fqLvyTUrw3IZwSgQN9Y0/FDrAHIVNpxiZ5LxjEg9EnVhfBh1vPw5LwcCgUuQ++rhSoW9WD5pygp8R1alDWprxbJmOclHt5uIcS7JHih8sonPXjUSXe9BWdSFcxsfBRJo8vRfeUj8Jt3x7HeqfoSDaAhSdZ3jtthkschDzMbI8KGKXZLgDaD3CJo+MZNe+cg5zmTfhYpHjKJW41P6XNnJLSxV5YfugLF9vE+JUEea17bOTiwxyTLqrCv3UuaW2YxDpK+5/i2hYalJJ0kwLlX0CJmcScpyp9dkgnPl9mwlt4exxBHY5WxMMq7f4z454CpmpJmuvjJCTDbS/dGGb+/DLxp7k51m3VJ1B40MeIweDGBzTATBgkqhkiG9w0BCRUxBgQEAQAAADBXBgkqhkiG9w0BCRQxSh5IADQAYgBhADMAOABlADAAMwAtAGEAOQBhAGMALQA0ADMAMAA3AC0AOAA5ADcAYQAtADQAMwBhADMAYwA0ADgANQBhADQANwA4MF0GCSsGAQQBgjcRATFQHk4ATQBpAGMAcgBvAHMAbwBmAHQAIABTAHQAcgBvAG4AZwAgAEMAcgB5AHAAdABvAGcAcgBhAHAAaABpAGMAIABQAHIAbwB2AGkAZABlAHIwggNSBgkqhkiG9w0BBwGgggNDBIIDPzCCAzswggM3BgsqhkiG9w0BDAoBA6CCAw8wggMLBgoqhkiG9w0BCRYBoIIC+wSCAvcwggLzMIIB36ADAgECAhAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQAwFDESMBAGA1UEAxMJbG9jYWxob3N0MB4XDTE1MDkxNjA2MjEwOVoXDTE2MDkxNjA2MjEwOFowFDESMBAGA1UEAxMJbG9jYWxob3N0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAmlGZoaDWCOXZWDXlwaUJjGvfdpUpxYdoGxiHBiUowr/l+fjFq+xpqgjS/5VSXd64qq3ErofOc7yUrTFANMdZG6dVfzRaoHNXOZQjmvbGMOOOJQw4zc/Bt5DTqyeRt462gxLP/HkIE45NE8qRvsH6uU1BN1q/pHxsnkaeURqFcNgVR1f7BQfobj1hpneEBIYh75GRpSUo02VS5gGDc+B/sWB2fns7m05rnquYmpAQRQ7NyvoVn3JYj/EFN7RpdltqDytveMV50SSqLnKbcEmzBlRZWLXfdFNJjIwMBvI4z9nGkJFMvmn0qFAuPH+5WEZXN5F6OU83EQ0oAURfnJkEcwIDAQABo0kwRzBFBgNVHQEEPjA8gBAZzi0Ai4TG1VvIcjfkv3KIoRYwFDESMBAGA1UEAxMJbG9jYWxob3N0ghAytp0sn8UPr08doiEdX3DhMAkGBSsOAwIdBQADggEBAHGylUp5KGHHIlfAdmgcGZ89xtE5hpP0Wsd8KlE+HKfIDbt5SB7YBGl8gEMUAGMnVvBkRHtLmTJO0Ez3VOmTLCAOywOkduFXNsOjVkwPyCUnLlz7EriNpNmPT9ZWTVQdFV7FRXhjEgRXLCX+tHz1IF677MIx6kL47AmVKnr+9g+Fr5BfZDkNZpapJK3mflIdaM14jOz6rsqsWsJMCOlQPKcltq7BTKgIquNPDehHyFQjbMvMZg9Kf/YbpuJKFpCzfi+uiMyc5StBRVIF2buKQnvCiuTe6HUE6ZcrJxsEx7dSJ/lG5h3gPsfoNWxnTiomWV/Zp2E2YwfavjPhWjrBwrIxFTATBgkqhkiG9w0BCRUxBgQEAQAAADA7MB8wBwYFKw4DAhoEFK1iGPraW4Gf5vcA8VhmKYAjnMYSBBQ3NJrH26kpqUt0uX/tsyJX2WCqKAICB9A=";
    var cert = new X509Certificate2(Convert.FromBase64String(certString), "testPassword");
    await server.AuthenticateAsServerAsync(cert);

    // Write 10500, 100 byte frames
    for (int i = 0; i < 10500; i++)
    {
        await server.WriteAsync(new byte[100]);
    }
}

async Task Client()
{
    await client.AuthenticateAsClientAsync(new SslClientAuthenticationOptions
    {
        TargetHost = "localhost",
        RemoteCertificateValidationCallback = (sender, certificate, chain, sslPolicyErrors) => true
    });

    // 1MB
    var buffer = new byte[0x100000];

    do
    {
        var read = await client.ReadAsync(buffer).ConfigureAwait(false);
        if (read == 0)
        {
            break;
        }

        Console.WriteLine("Read: " + read);
    }
    while (true);
}

var s1 = Server();
var s2 = Client();
await Task.WhenAll(s1, s2);


internal class DuplexPipe : IDuplexPipe
{
    public DuplexPipe(PipeReader reader, PipeWriter writer)
    {
        Input = reader;
        Output = writer;
    }

    public PipeReader Input { get; }

    public PipeWriter Output { get; }

    public static DuplexPipePair CreateConnectionPair(PipeOptions inputOptions, PipeOptions outputOptions)
    {
        var input = new Pipe(inputOptions);
        var output = new Pipe(outputOptions);

        var transportToApplication = new DuplexPipe(output.Reader, input.Writer);
        var applicationToTransport = new DuplexPipe(input.Reader, output.Writer);

        return new DuplexPipePair(applicationToTransport, transportToApplication);
    }

    // This class exists to work around issues with value tuple on .NET Framework
    public readonly struct DuplexPipePair
    {
        public IDuplexPipe Transport { get; }
        public IDuplexPipe Application { get; }

        public DuplexPipePair(IDuplexPipe transport, IDuplexPipe application)
        {
            Transport = transport;
            Application = application;
        }
    }
}

internal class DuplexPipeStream : Stream
{
    private readonly PipeReader _input;
    private readonly PipeWriter _output;
    private readonly bool _throwOnCancelled;
    private volatile bool _cancelCalled;

    public DuplexPipeStream(PipeReader input, PipeWriter output, bool throwOnCancelled = false)
    {
        _input = input;
        _output = output;
        _throwOnCancelled = throwOnCancelled;
    }

    public void CancelPendingRead()
    {
        _cancelCalled = true;
        _input.CancelPendingRead();
    }

    public override bool CanRead => true;

    public override bool CanSeek => false;

    public override bool CanWrite => true;

    public override long Length
    {
        get
        {
            throw new NotSupportedException();
        }
    }

    public override long Position
    {
        get
        {
            throw new NotSupportedException();
        }
        set
        {
            throw new NotSupportedException();
        }
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotSupportedException();
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        // ValueTask uses .GetAwaiter().GetResult() if necessary
        // https://github.com/dotnet/corefx/blob/f9da3b4af08214764a51b2331f3595ffaf162abe/src/System.Threading.Tasks.Extensions/src/System/Threading/Tasks/ValueTask.cs#L156
        return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), default).Result;
    }

    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken = default)
    {
        return ReadAsyncInternal(new Memory<byte>(buffer, offset, count), cancellationToken).AsTask();
    }

    public override ValueTask<int> ReadAsync(Memory<byte> destination, CancellationToken cancellationToken = default)
    {
        return ReadAsyncInternal(destination, cancellationToken);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        WriteAsync(buffer, offset, count).GetAwaiter().GetResult();
    }

    public override async Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        if (buffer != null)
        {
            _output.Write(new ReadOnlySpan<byte>(buffer, offset, count));
        }

        await _output.FlushAsync(cancellationToken);
    }

    public override async ValueTask WriteAsync(ReadOnlyMemory<byte> source, CancellationToken cancellationToken = default)
    {
        _output.Write(source.Span);
        await _output.FlushAsync(cancellationToken);
    }

    public override void Flush()
    {
        FlushAsync(CancellationToken.None).GetAwaiter().GetResult();
    }

    public override Task FlushAsync(CancellationToken cancellationToken)
    {
        return WriteAsync(null, 0, 0, cancellationToken);
    }

    private async ValueTask<int> ReadAsyncInternal(Memory<byte> destination, CancellationToken cancellationToken)
    {
        while (true)
        {
            var result = await _input.ReadAsync(cancellationToken);
            var readableBuffer = result.Buffer;
            try
            {
                if (_throwOnCancelled && result.IsCanceled && _cancelCalled)
                {
                    // Reset the bool
                    _cancelCalled = false;
                    throw new OperationCanceledException();
                }

                if (!readableBuffer.IsEmpty)
                {
                    // buffer.Count is int
                    var count = (int)Math.Min(readableBuffer.Length, destination.Length);
                    readableBuffer = readableBuffer.Slice(0, count);
                    readableBuffer.CopyTo(destination.Span);
                    return count;
                }

                if (result.IsCompleted)
                {
                    return 0;
                }
            }
            finally
            {
                _input.AdvanceTo(readableBuffer.End, readableBuffer.End);
            }
        }
    }

    public override IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
    {
        throw new NotSupportedException();
    }

    public override int EndRead(IAsyncResult asyncResult)
    {
        throw new NotSupportedException();
    }

    public override IAsyncResult BeginWrite(byte[] buffer, int offset, int count, AsyncCallback callback, object state)
    {
        throw new NotSupportedException();
    }

    public override void EndWrite(IAsyncResult asyncResult)
    {
        throw new NotSupportedException();
    }
}
Author: davidfowl
Assignees: -
Labels:

area-System.Net.Security, tenet-performance, untriaged

Milestone: -

@geoffkizer
Copy link
Contributor

Yeah, I agree, we could do better here.

That said, I'm not sure how much of a potential win there is here in practice. As you say above, it's mostly about avoiding extra async machinery here. Should be relatively straightforward to measure.

@davidfowl
Copy link
Member Author

The issue filed in Kestrel was filed by a customer that experience slow downloads. My guess is that this also affects YARP performance for large responses. We should measure regardless though.

@samsp-msft
Copy link
Member

We do see a drop off of performance in YARP compared to Envoy for 10k response size. I'm wondering if this is a factor there?

@karelz
Copy link
Member

karelz commented Mar 2, 2021

Triage: @wfurt wants to do a BenchmarkDotNet test to see how much we could get. Based on estimated impact, we can decide if it is worth it in 6.0.
This will help with decryption. Encryption might have similar problem, but in different code path.
YARP may benefit from this.

@karelz karelz added enhancement Product code improvement that does NOT require public API changes/additions and removed untriaged New issue has not been triaged by the area owner labels Mar 2, 2021
@karelz karelz added this to the 6.0.0 milestone Mar 2, 2021
@scalablecory
Copy link
Contributor

I don't really see any downsides to doing this, assuming it won't need a huge code refactor, and so long as we don't introduce extra reads on the base stream.

@wfurt
Copy link
Member

wfurt commented Mar 3, 2021

I did quick PoC to bundle the frames and still get only frame.Lenght of data back from OpenSSL. So there may be little bit more trickery and possibly OS dependency.

@davidfowl
Copy link
Member Author

I did quick PoC to bundle the frames and still get only frame.Lenght of data back from OpenSSL. So there may be little bit more trickery and possibly OS dependency.

What do you mean? You can't get the next frame after reading the first one?

@geoffkizer
Copy link
Contributor

@wfurt can you post the code?

@geoffkizer
Copy link
Contributor

@wfurt if you are working on a fix for this it would be nice to fix #49086 at the same time -- that may just fall out of the fix here.

@wfurt
Copy link
Member

wfurt commented Mar 5, 2021

So it goes like this @geoffkizer. When we feed multiple frames to OpenSSL, they would get queued in input BIO (note that outBuffer is also the actual inout)

int retVal = BioWrite(context.InputBio!, outBuffer, offset, count);

When we try to read from the SSL it returns data only from first frame

fixed (byte* fixedBuffer = outBuffer)
{
retVal = Ssl.SslRead(context, fixedBuffer + offset, outBuffer.Length);
}

so the fix (in my prototype) was to check if input BIO has any unprocessed data and turn the crank again.

With that I get what @davidfowl wants:

furt@ubu18:~/projects/sslBuffer$ /home/furt/projects/sslBuffer/bin/Debug/net6.0/linux-x64/publish/sslBuffer | grep Read:
Read: 26400
Read: 26800
Read: 26800
...
Read: 26800
Read: 26800
Read: 5200

I see similar behavior on Windows. There was trickery as there is somewhat obscure mechanism to report unprocessed data.
I did not touch macOS yet but the behavior and part of the fix is platform dependent.

Part of the risk is fact that there are data buffered outside of SslStream visibility. Without the added retry data would still be consumable but the processing would hang as SslStream did not see any new data. We will need to be careful about it.

@geoffkizer
Copy link
Contributor

It kind of makes sense that schannel/OpenSsl only process a single frame at a time. If they processed more than one, it would be harder to report the results.

We know the frame size in SslStream, so we could just feed a single frame at a time to schannel/OpenSsl and (hopefully) not rely on their "unprocessed data" mechanism. That seems like it avoids most of the extra complication. Is that correct?

@wfurt
Copy link
Member

wfurt commented Mar 8, 2021

I tried to write a benchmark to see if there are any measurable improvements but it is not stable enough. I tried to increase the loops and fiddle with various parameters but the variations are still much larger than measurable impact e.g. +-10%


[2021/03/07 22:12:48][INFO] |          Method |        Job |     Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD |   Gen 0 | Gen 1 | Gen 2 | Allocated |
[2021/03/07 22:12:48][INFO] |---------------- |----------- |-------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|--------:|------:|------:|----------:|
[2021/03/07 22:12:48][INFO] | SmallPipeWrites | Job-WZHAFT | /bulk/corerun | 618.1 us | 32.51 us | 37.44 us | 618.2 us | 569.0 us | 700.4 us |  0.91 |    0.08 | 18.0000 |     - |     - |     77 KB |
[2021/03/07 22:12:48][INFO] | SmallPipeWrites | Job-DHRXVJ | /main/corerun | 677.3 us | 38.10 us | 42.35 us | 671.5 us | 609.8 us | 770.5 us |  1.00 |    0.00 | 16.0000 |     - |     - |     67 KB |

[2021/03/07 22:14:41][INFO] |          Method |        Job |     Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD |   Gen 0 | Gen 1 | Gen 2 | Allocated |
[2021/03/07 22:14:41][INFO] |---------------- |----------- |-------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|--------:|------:|------:|----------:|
[2021/03/07 22:14:41][INFO] | SmallPipeWrites | Job-GFZQFZ | /bulk/corerun | 638.5 us | 31.23 us | 34.72 us | 635.1 us | 590.4 us | 706.0 us |  1.08 |    0.07 | 18.0000 |     - |     - |     74 KB |
[2021/03/07 22:14:41][INFO] | SmallPipeWrites | Job-IJIPFV | /main/corerun | 592.5 us | 20.92 us | 23.25 us | 592.9 us | 544.6 us | 631.7 us |  1.00 |    0.00 | 16.0000 |     - |     - |     67 KB |

I did not observe difference with the change on the other tests.

Since the change is pretty small, and I could see possibility of impact on system calls, (e.g. transport other than memory) I'm inclined to move forward even if we cannot demonstrate improvement.

cc: @stephentoub for any additional thoughts or hints on the benchmark.

Linux changes can be viewed here: https://github.com/dotnet/runtime/compare/main...wfurt:bulkTls?expand=1

@stephentoub
Copy link
Member

so the fix (in my prototype) was to check if input BIO has any unprocessed data and turn the crank again.

This correlates to the BioCtrlPending call?
https://github.com/dotnet/runtime/compare/main...wfurt:bulkTls?expand=1#diff-5d47fb42fd96c09e13b52482e126d7a282513646feb5f672108dbbd7dd0d5f4aR382
How expensive is that call, and if not super cheap, is this overhead we wouldn't otherwise be paying?

I could see possibility of impact on system calls

How will this impact syscalls? I thought this was purely about avoiding traversing through a few layers of calls to get back to this place again. It doesn't change how many reads or writes or such on the underlying stream we're issuing, though, right?

I tried to write a benchmark to see if there are any measurable improvements but it is not stable enough

Can you share it?

@wfurt
Copy link
Member

wfurt commented Mar 8, 2021

https://github.com/dotnet/performance/compare/master...wfurt:sslBulk?expand=1

As far as the syscall. When somebody reads data from the stream, that may do something about it - like write to another descriptor and that can be also possibly visible on TCP. The overhead of traversing layers is negligible IMHO comparing to cost of the actual crypto.

The BioCtrlPending should be cheap on memory stream.
https://github.com/openssl/openssl/blob/9afc6c54314f94c0dcb4168d01554497bfaeae4f/crypto/bio/bss_mem.c#L329-L331

    case BIO_CTRL_PENDING:
        ret = (long)bm->length;
        break;

so it is mostly cost of the call. This can be possibly different on other platforms.

@stephentoub
Copy link
Member

As far as the syscall. When somebody reads data from the stream, that may do something about it - like write to another descriptor and that can be also possibly visible on TCP.

I see. The argument is that the caller can't do the same thing because they don't know whether an attempt to fill their buffer after the first call will result in additional I/O / whether the SslStream already has decryptable data buffered. Ok, that's reasonable.

additional thoughts or hints on the benchmark.

Can you try running the ASP.NET benchmarks with it, e.g. the ones for YARP?

@davidfowl
Copy link
Member Author

There's more overhead with wrapping streams as you pay the async overhead per read. For example, in the linked issue (dotnet/aspnetcore#30545), on HttpClient you pay for:

ContentLengthReadStream.ReadAsync -> HttpConnection.ReadAsync -> SslStream.ReadAsync

On the server side you pay for:

HttpRequestStream -> Http1ContentLengthMessageBody -> SslStream

@stephentoub
Copy link
Member

There's more overhead with wrapping streams as you pay the async overhead per read

In practice, does that show up as being measurable, in particular when compared to encryption/decryption costs?

@davidfowl
Copy link
Member Author

You also need to look at the cost of async reads from the underlying Stream right? This may result in more allocations as well, not just CPU cost.

@stephentoub
Copy link
Member

You also need to look at the cost of async reads from the underlying Stream right?

Why would or would not the presence of this result in more or fewer reads on the underlying stream?

@davidfowl
Copy link
Member Author

Why would or would not the presence of this result in more or fewer reads on the underlying stream?

Assume the incoming user buffer was 1MB, you could imagine reading directly into the user buffer and decrypting in place. Then copying the remainder into the internal buffer.

@stephentoub
Copy link
Member

I'm missing something :) How does your scenario impact the number of reads on the underlying stream?

@davidfowl
Copy link
Member Author

The current fix doesn't, but in this scenario, you can do bigger reads on the underlying Stream because you have a bigger buffer 😄 (or maybe I'm missing something)

@stephentoub
Copy link
Member

The current fix doesn't, but in this scenario, you can do bigger reads on the underlying Stream because you have a bigger buffer

Ok, sure, that just seems orthogonal to this issue. :)

(Of course, if you were doing reads directly into the user's buffer, then you're more likely to want to want to aggressively decrypt what was in it, as you need to copy out of it anything that wasn't decrypted, and have room to store it.)

@stephentoub
Copy link
Member

Anyway, as noted above, I'd be interested in us running this through some real benchmarks (e.g. YARP), but the change is small and localized, and it's hard to come up with a reason why it would make things worse in any real use, so it should be fine to do if we suspect it might help.

@benaadams
Copy link
Member

... when compared to encryption/decryption costs?

Most server CPUs have specific instructions and can encrypt & decrypt at more than 10Gbps per core; and an AMD Ryzen 7 1800X can do AES-128-GCM at 24Gbps per core and AES-256-GCM at 21GBps per core*

*Non-hyperthreaded cores https://calomel.org/aesni_ssl_performance.html

@geoffkizer
Copy link
Contributor

As I mentioned above:

We know the frame size in SslStream, so we could just feed a single frame at a time to schannel/OpenSsl and (hopefully) not rely on their "unprocessed data" mechanism. That seems like it avoids most of the extra complication. Is that correct?

This seems like it would avoid the call to BioCtrlPending entirely and simplify the code here.

@davidfowl
Copy link
Member Author

@wfurt It might be useful to run the same scenario that @smourier ran, they filed the original issue and maybe there's a more concrete test there that shows the difference.

I'm also going to add a download test in the ASP.NET Core benchmarks (GB range) so we can track changes here.

@wfurt
Copy link
Member

wfurt commented Mar 10, 2021

So basically loop inside ReadAsyncInternal and pay the overhead of multiple internal calls @geoffkizer? The benefit would be observed externally as ReadAsync would yield bigger chunks of data?
As I was thinking about it more, I can probably craft perf test without the pipelines to see if that is more stable.

@davidfowl
Copy link
Member Author

Maybe use loop back instead of using pipelines. (I should have done that 😄 )

@wfurt
Copy link
Member

wfurt commented Mar 10, 2021

There is also test using NamedPipe. That may better represent reality as it does system calls and context switches.

@geoffkizer
Copy link
Contributor

So basically loop inside ReadAsyncInternal and pay the overhead of multiple internal calls @geoffkizer? The benefit would be observed externally as ReadAsync would yield bigger chunks of data?

Yeah, exactly.

I believe we will end up calling OpenSsl/SChannel the same number of times anyway, since they will only process one frame at a time. So it's really just about where we loop in our code. and I doubt that matters much in terms of perf.

@geoffkizer
Copy link
Contributor

Re perf testing, this may be useful: https://github.com/geoffkizer/netperf/tree/master/SocketPerfTest

@davidfowl
Copy link
Member Author

This issue started with a customer complaint about slow file downloads in ASP.NET Core. I understand we want to isolate the change to avoid noise but maybe we should use the customer example as a way to vet the change?

@geoffkizer
Copy link
Contributor

maybe we should use the customer example as a way to vet the change?

Do we have that code? Or are you saying that the code above is that code?

@davidfowl
Copy link
Member Author

There's some code in linked issue and there's also a scenario in YARP that might be interesting that is outlined there.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 17, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 24, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Security enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
8 participants