Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for sentinel value when setting HTTP/3 error code #57934

Merged
merged 5 commits into from
Sep 20, 2024

Conversation

amcasey
Copy link
Member

@amcasey amcasey commented Sep 18, 2024

If no error code has been set, IProtocolErrorFeature.Error will be -1. If we pass that through verbatim, it will be caught by validation in the setter (ironically, of the same property on the same feature object), resulting in an exception and a Critical (but apparently benign) log message.

Fixes #57933

If no error code has been set, `IProtocolErrorFeature.Error` will be `-1`.  If we pass that through verbatim, it will be caught by validation in the setter (ironically, of the same property on the same feature object), resulting in an exception and a Critical (but apparently benign) log message.

Fixes dotnet#57933
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label Sep 18, 2024
@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

This needs a regression test, but I figured I'd run the approach by people before figuring out how to reliably simulate connection timeouts.

@halter73
Copy link
Member

Seems good to me. I see that we're calling QuicTransportOptions.ValidateErrorCode everywhere you can manually set the error code, and the IProtocolErrorCodeFeature.Error doc comment indicates "The property returns -1 if the error code hasn't been set."

Do we plan to backport this?

@adityamandaleeka
Copy link
Member

If the critical message is indeed benign, this seems fine to fix in 9.0 servicing to me.

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

AFAICT, it's benign. It only happens as the connection's being aborted and other connections should be unaffected.

Of course, as I type this, it occurs to me I should confirm that this doesn't prevent the connection from being torn down...

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

I was going to propose reverting #55282 as a safer fix (since it was made to prevent a benign assert in S.N.Q), but I see from the description of that PR that S.N.Q. probably made checks stricter once we stopped passing them bad data, so a forward fix may be our only option.

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

Hmm, it looks like we were expecting to hit this handler when the connection timed out:

catch (QuicException ex) when (ex.QuicError == QuicError.ConnectionTimeout)

@halter73
Copy link
Member

Hmm, it looks like we were expecting to hit this handler when the connection timed out:

catch (QuicException ex) when (ex.QuicError == QuicError.ConnectionTimeout)

I don't think this is the same thing as a keep alive timeout. QuicError.ConnectionTimeout seems to be about a transport-level timeout "waiting for a response from the peer.". The comment below also seems to indicate that the peer is unresponsive after opening a stream without closing it. This is not a scenario where a keep alive timeout would apply.

// ConnectionTimeout can happen when the client app is shutdown without aborting the connection.
// For example, a console app makes a HTTP/3 request with HttpClient and then exits without disposing the client.

What are we actually seeing for a keep alive timeout? Given that we appear to cancel the token passed into QuicConnectionContext.AcceptAsync in StopProcessingNextRequest which is what we should be calling in the event of a keep alive timeout, I'd expect an OCE which we appear to handle properly.

// Break out of AcceptStreams so connection state can be updated.
_acceptStreamsCts.Cancel();

@JamesNK Does that line up with your expectations?

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

I was in the debug-only catch-all block and the code was ConnectionIdle. I haven't seen it since that first time though.

Edit: I still see connection idle - I put my breakpoint in the wrong place.

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

Anecdata:

var client = new HttpClient()
{
    BaseAddress = new Uri("https://localhost:5001"),
    DefaultRequestVersion = new Version(3, 0),
    DefaultVersionPolicy = HttpVersionPolicy.RequestVersionExact,
};

try
{
    var response = await client.GetAsync("/");
    response.EnsureSuccessStatusCode();

    var data = await response.Content.ReadAsStringAsync();
    Console.WriteLine(data);
}
catch (HttpRequestException e)
{
    Console.WriteLine($"Request error: {e.Message}");
}

//Thread.Sleep(TimeSpan.FromSeconds(30)); // ConnectionIdle with Sleep, ConnectionTimeout without
client.Dispose();

I'm interpreting that as ConnectionIdle if the client has gracefully disconnected and ConnectionTimeout otherwise.

@halter73
Copy link
Member

Interesting. I wouldn't have expected QuicError.ConnectionIdle considering we try to disable that here:

IdleTimeout = Timeout.InfiniteTimeSpan, // Kestrel manages connection lifetimes itself so it can send GoAway's.

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

I'm having trouble figuring out whether the unexpected ConnectionIdle exception is related. I don't think so, since we also need to fix up -1 in the ConnectionTimeout case? Should I just file a separate issue to investigate that later or is it worth pursuing now in case something in Quic change?

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

FWIW, S.N.Q is getting it via native callback, presumably from msquic:

>	System.Net.Quic.QuicConnection.HandleEventShutdownInitiatedByTransport(ref Microsoft.Quic.QUIC_CONNECTION_EVENT._Anonymous_e__Union._SHUTDOWN_INITIATED_BY_TRANSPORT_e__Struct) Line 659	C#
 	System.Net.Quic.QuicConnection.HandleConnectionEvent(ref Microsoft.Quic.QUIC_CONNECTION_EVENT) Line 761	C#
 	System.Net.Quic.QuicConnection.NativeCallback(Microsoft.Quic.QUIC_HANDLE*, void*, Microsoft.Quic.QUIC_CONNECTION_EVENT*) Line 796	C#

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

If the exception occurs, these lines won't be executed

KestrelMetrics.AddConnectionEndReason(MetricsContext, reason);

if (TryStopAcceptingStreams())
{
    SendGoAwayAsync(GetCurrentGoAwayStreamId()).Preserve();
}

_multiplexedContext.Abort(ex);

Losing the metric is a bummer, but not shipblocking (IMO).
TryStopAcceptingStreams should already have been called here
QuicConnectionContext.Abort appears to already have been called here

Edit: the catch block calls Abort anyway, so we're only changing the error code and end reason. Though, as mentioned above, those calls were probably going to no-op.

Then, in ProcessRequestsAsync, we lose

// Wait for active requests to complete.
while (_activeRequestCount > 0)
{
    await _streamCompletionAwaitable;
}

TimeoutControl.CancelTimeout();

I'm guessing, but haven't verified, that the stream count has to be zero for the connection to time out?

Edit: I'm not convinced this is the case

I'm guessing, but haven't verified, that not cancelling the timeout is suboptimal but harmless.

Edit: I'm still reasonably sure of this, since the control is per-connection. I guess it's possible a timeout could fire, but the connection should already have been aborted, so it shouldn't do anything worse than log unnecessarily?

@amcasey
Copy link
Member Author

amcasey commented Sep 18, 2024

I set this in the repro server, but it doesn't seem to actually matter - the connection times out after about 20 seconds regardless.

options.Limits.KeepAliveTimeout = TimeSpan.FromSeconds(5);

Edit: Stephen pointed out that I was once again being bitten by the fact that we ignore timeouts when a debugger is attached.

amcasey and others added 2 commits September 19, 2024 10:18
Co-authored-by: James Newton-King <james@newtonking.com>
@amcasey
Copy link
Member Author

amcasey commented Sep 19, 2024

FYI @ManickaP that there are late changes in this area. I don't expect S.N.Q to have to react, but any feedback would be welcome.

Any thoughts on ConnectionIdle would also be appreciated.

@amcasey
Copy link
Member Author

amcasey commented Sep 19, 2024

It seems unlikely that this change to Http3Connection broke HttpClientHttp2InteropTests.ServerReset_BeforeRequestBody_ClientBodyThrows.

@ManickaP
Copy link
Member

FYI @ManickaP that there are late changes in this area. I don't expect S.N.Q to have to react, but any feedback would be welcome.

Do you still need to differentiate the not-set ErrorCode state? If not, why don't you use NoError as default value instead of -1. Otherwise, LGTM.

Any thoughts on ConnectionIdle would also be appreciated.

But the client still has IdleTimeout set. It gets sent to the server in the connection settings and each side will set their idle timeout to the min of the 2 values: https://www.rfc-editor.org/rfc/rfc9000.html#name-idle-timeout.

@amcasey
Copy link
Member Author

amcasey commented Sep 20, 2024

Do you still need to differentiate the not-set ErrorCode state? If not, why don't you use NoError as default value instead of -1. Otherwise, LGTM.

I wondered the same thing - I'll investigate when we're not trying to squeeze it into RC2.

@amcasey amcasey merged commit 9afcb72 into dotnet:main Sep 20, 2024
27 checks passed
@amcasey amcasey deleted the gh57933 branch September 20, 2024 15:24
@dotnet-policy-service dotnet-policy-service bot added this to the 10.0-preview1 milestone Sep 20, 2024
@amcasey
Copy link
Member Author

amcasey commented Sep 20, 2024

Updated #49780 for the idle behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Critical log on quic connection timeout
5 participants