Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

Closed
CarnaViire opened this issue Mar 7, 2022 · 24 comments · Fixed by #66607 or #71785
Closed

Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297

CarnaViire opened this issue Mar 7, 2022 · 24 comments · Fixed by #66607 or #71785
Assignees
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@CarnaViire
Copy link
Member

Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.

The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (DefaultConnectTimeout = Timeout.InfiniteTimeSpan), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same "pending" connection.

We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.

I propose to unify Windows and Linux timeouts, i.e. set DefaultConnectTimeout to 21s.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net.Http untriaged New issue has not been triaged by the area owner labels Mar 7, 2022
@ghost
Copy link

ghost commented Mar 7, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.

The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (DefaultConnectTimeout = Timeout.InfiniteTimeSpan), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same "pending" connection.

We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.

I propose to unify Windows and Linux timeouts, i.e. set DefaultConnectTimeout to 21s.

Author: CarnaViire
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@stephentoub
Copy link
Member

This seems reasonable. Technically it's a breaking change, but if a connection is taking that long to be established, it's likely going to fail, anyway. There's also value in consistency across platforms.

Is the exception that occurs in the case of a timeout due to ConnectTimeout the same that occurs in response to the OS timing out a connection?

@CarnaViire
Copy link
Member Author

Is the exception that occurs in the case of a timeout due to ConnectTimeout the same that occurs in response to the OS timing out a connection?

Unfortunately, no.

If OS did the timeout, the error would bubble up from sockets, so we would see HttpRequestException with inner SocketException ((110): Connection timed out on Linux and (10060): A connection attempt failed because the connected party did not properly respond after a period of time.... on Windows)

If it was our timeout, we would see TaskCanceledException with inner TimeoutException: A connection could not be established within the configured ConnectTimeout.

@geoffkizer
Copy link
Contributor

Perhaps we should use 15s. That means that typically we'd do the timeout ourselves on both platforms. It's also a reasonably round number as compared to 21s.

@wfurt wfurt removed the untriaged New issue has not been triaged by the area owner label Mar 8, 2022
@wfurt wfurt added this to the 7.0.0 milestone Mar 8, 2022
@MihaZupan
Copy link
Member

Since we try connecting to all IPs returned by DNS (currently in serial), would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

@CarnaViire
Copy link
Member Author

Since we try connecting to all IPs returned by DNS (currently in serial), would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

You are right 😢 and we currently don't have any per-IP timeout for multi-connect in Sockets (only a cancellation token for the whole operation) -- should we consider some new API like that?

@ManickaP
Copy link
Member

ManickaP commented Mar 9, 2022

would changing the ConnectTimeout to something lower than the OS's timeout change the behavior so that we may no longer end up trying different IPs?

Only in case of Windows. In case of Linux, we're not losing anything, since the OS timeout ends up being ~2 minutes. The user can always change it back to infinite if this proves problematic. They can also implement their own ConnectCallback.

Eventually we could do something like happy eyeballs and try to connect to multiple IPs in parallel (related #26177).

@wfurt
Copy link
Member

wfurt commented Mar 9, 2022

Even with eyeballs, you can have multiple addresses for the same family. And only some of them may be unreachable/unresponsive. But I would feel in most cases callers would care about overall responsiveness instead of details for each individual address.

@ManickaP
Copy link
Member

ManickaP commented Mar 9, 2022

By "something like happy eyeballs" I meant trying the parallel connect, even to multiple addresses of the same family at the same time. This is just an idea so please do not crucify me on all the details and pitfalls. I'm aware this wouldn't be as easy as running n connects in n tasks and seeing which finishes first. It would need to be carefully designed to not to drag down perf, hog ephemeral ports etc.

@CarnaViire
Copy link
Member Author

Linking the existing issue containing API for socket's happy-eyeball-like parallel connect #33418
We might want to revisit it and add more flexible options, e.g. not connecting to all IPs at once, but rather try to connect to the first IP, wait for 5s, and if it doesn't connect, try to connect to the second one without cancelling the first one, etc.

@CarnaViire
Copy link
Member Author

Triage: change DefaultConnectTimeout to 15s in 7.0 (breaking change).
Multi-connect issues should be relatively rare. If it occurs, ConnectTimeout could be to set back to a big value. If we would see demand we would prioritize socket connect changes linked above.
For 6.0, consider adding telemetry to help troubleshoot hanged "pending" connection issue.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 14, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 15, 2022
CarnaViire added a commit that referenced this issue Mar 17, 2022
The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool.

The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection.

Related to #66297
radekdoulik pushed a commit to radekdoulik/runtime that referenced this issue Mar 30, 2022
The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool.

The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection.

Related to dotnet#66297
@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2022
@CarnaViire
Copy link
Member Author

I've put up a PR to revert this change #68649

The change proved to be problematic as-is, because

  1. There is a high chance that the user code has a retry logic that does retry when SocketException is caught but does not retry in case of OperationCanceledException. This can lead to a spike of (unretried) exceptions which weren't there before. Example:
    https://github.com/NuGet/NuGet.Client/blob/93bb3921490e9d499be1fb48d6cb9dcaff734db1/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpRetryHandler.cs#L212-L263

  2. The workaround with setting the ConnectTimeout back to infinity may not be available -- if HttpClientHandler is used instead of SocketsHttpHandler (e.g. due to reusing code between .NET Framework and .NET Core/.NET 5+), or if the code that executes the requests is hidden inside a library.

I'm reopening the issue for further discussion.

@CarnaViire CarnaViire reopened this Apr 28, 2022
@CarnaViire CarnaViire added the untriaged New issue has not been triaged by the area owner label Apr 28, 2022
@dotnet dotnet unlocked this conversation Apr 28, 2022
@CarnaViire
Copy link
Member Author

CarnaViire commented Apr 28, 2022

Triage: we should address the issue with long OS timeouts in 7.0 in one way or another (retaining the old exception type, tuning OS timeouts, reworking pooling logic in some way).

@CarnaViire CarnaViire removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2022
@pentp
Copy link
Contributor

pentp commented May 3, 2022

As a SocketsHttpHandler user I would find it very surprising to get a TaskCanceledException for something I haven't canceled. I would expect either SocketException or TimeoutException.

@karelz
Copy link
Member

karelz commented May 10, 2022

Triage: Let's separate 2 parts here:

  1. nuget (and potentially other customers) was broken. We may break them again, but we need to document it properly.
    • The remaining risk is that someone doesn't like 15s timeout and cannot change it easily (e.g. because they target .NET Standard) -- ugly workaround for that is Reflection.
  2. Which exception to throw to be consistent with our stack -- let's look at recent changes we did in the space and align with them. We may decide to follow that pattern (TaskCanceledException with inner TimeoutException), or fake SocketsException with inner TimeoutException, or start throwing directly TimetoutException.

@CarnaViire
Copy link
Member Author

Which exception to throw to be consistent with our stack -- let's look at recent changes we did in the space and align with them. We may decide to follow that pattern (TaskCanceledException with inner TimeoutException), or fake SocketsException with inner TimeoutException, or start throwing directly TimetoutException.

We already follow the pattern "TaskCanceledException with inner TimeoutException" in both places:
For full-request-timeout

else if (!pendingRequestsCts.IsCancellationRequested)
{
// If this exception is for cancellation, but cancellation wasn't requested, either by the caller's token or by the pending requests source,
// the only other cause could be a timeout. Treat it as such.
e = toThrow = new TaskCanceledException(SR.Format(SR.net_http_request_timedout, _timeout.TotalSeconds), new TimeoutException(e.Message, e), oce.CancellationToken);
}

and for connect timeout
private static Exception CreateConnectTimeoutException(OperationCanceledException oce)
{
// The pattern for request timeouts (on HttpClient) is to throw an OCE with an inner exception of TimeoutException.
// Do the same for ConnectTimeout-based timeouts.
TimeoutException te = new TimeoutException(SR.net_http_connect_timedout, oce.InnerException);
Exception newException = CancellationHelper.CreateOperationCanceledException(te, oce.CancellationToken);
ExceptionDispatchInfo.SetCurrentStackTrace(newException);
return newException;
}

There was a huge discussion in #21965 on what is better to do on timeouts, which ended up introducing this specific pattern. In general, we didn't want to change the exception type as historically TCE was thrown, and we just added a way to programmatically check whether it was a timeout (by putting TimeoutException inside)

Re mimicking SocketException, it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄

catch (SocketException se) when (se.SocketErrorCode == SocketError.OperationAborted)
{
cancellationToken.ThrowIfCancellationRequested();
throw;
}

But there's no way on Socket level to say that the token signifies the timeout though.
I'm also not sure it would be ok to break existing behavior of ConnectTimeout, e.g. someone might depend that TCE was thrown on it?

@CarnaViire
Copy link
Member Author

We discussed it within the team, and we don't see a solution that would not be breaking in one way or another.

  1. The least breaking seem to be to simulate SocketException. This has more probability to be already retried by the user code, and that was the "default" behavior on Windows anyway, just after a bit longer interval (21s). This will bring parity between OSes in a more expected way. But we will effectively change existing ConnectTimeout property behavior.

  2. The other option is to bring back the initial PR as is (with just changing the default ConnectTimeout value) and document this as a breaking change properly, because we are already "following the pattern". This might be troublesome for all the reasoning already brought up when we reverted the PR (users not handling TCE properly).

Re following pattern of TCE with inner TimeoutException -- first, it was actually added to ConnectTimeout only in 6.0, before it was just plain TCE. And the exception nesting is a bit of a mess now (TCE->TimeoutException->TCE->TimeoutException, and it also has a misleading message, but there's an issue to fix that #67505) — the bottom line is I don't feel like a lot of people are depending on that.... and other point that was brought on team discussion is that we need to have a way to tell the difference between timeout on Connect and overall request timeout, and we can't do that in a good way now.

@halter73
Copy link
Member

halter73 commented Jun 1, 2022

I vote for simulating a SocketException with an inner TimeoutException if it's the socket that failed to connect within the timeout in order to more easily distinguish from a reqular request timeout. I usually don't really care if it's the OS or the framework that timed out the socket connection, but the message should make this clear.

it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄

If the cancellationToken fires before the timeout it should still be an OCE particularly if this is what could have caused the "timeout". I realize this is racy if in the background.

@antonfirsov
Copy link
Member

antonfirsov commented Jun 1, 2022

SocketException as a subtype of Win32Exception, which is a type that indicates an occurrence of a low level error reported by the OS. Throwing such an exception when there is no low level error directly linked to the request is not what I would expect from a framework. This would look even weirder when ConnectCallback is overridden in a way that there are no sockets involved at all.

@stephentoub
Copy link
Member

stephentoub commented Jun 1, 2022

As I mentioned offline, I continue to think that this is working around a symptom rather than addressing the root of the problem.

The core problem stems from a change made in .NET 6. When a request arrives and there’s no connection immediately available in the pool and there’s no MaxConnectionsPerServer limit set that’s been reached, a new connection will be created. In .NET 5 and earlier, that connection was dedicated to this specific request and vice versa: this request would be made on this connection and only this connection once it was ready. The downside to this is, if it takes a while to establish that connection and another connection becomes available in the meantime, that request would continue waiting for the connection it spawned. The upside to this, however, is if the connection took a long time to be established (or took so long to be established that it timed out), the only request it would harm (again assuming no MaxConnectionsPerServer limit reached) is the one request with which it was associated. In .NET 6, this behavior was changed, improving the downside but harming the upside. When a request arrives and there’s no connection available (including one that's pending), a new one is still immediately spawned, but it’s no longer associated with only that request and that request is no longer associated only with it. If another connection becomes available in the meantime, that newly-available connection can service the request. That improves the aforementioned downside. But now we have this pending connection that’s been spawned, and so when the next request arrives, it’ll see there’s already a connection pending and will try to use that one rather than spawning another. If this connection will never become usable for whatever reason, any requests that associate with it will effectively stall as well, such that this one request can now take out multiple requests rather than just one. The discussed ConnectTimeout change is an attempt to minimize the impact of this by lessening the default time that connection might be pending and thus lessening the number of requests that associate with it, but there are other ways to address this. Note as well that a relatively recent .NET 7 change made it so that when the connection eventually fails, it only fails the request that originally spawned it; if that request has already moved on or completed or been canceled, the connection’s failure ends up being silent (other than logging).

For example, we could change the pooling to avoid considering a pending connection as one that can be used for an incoming request; if a new request arrives and there’s no immediately usable connection to service the request, create a new one. If we’re worried about a backlog of pending connections to a bad server building up, we can have some internal limit on how many such pending connections are allowed. (If there is a MaxConnectionsPerServer and it’s been reached, then we wait just as we always did). Instead or in addition, we could also lower the connection timeout once the connection is no longer associated with the original request, since it’s not going to fail anyone else anyway.

@CarnaViire
Copy link
Member Author

  1. @halter73 In the end, on team discussion we decided agains SocketException. The main point is what @antonfirsov said - that ConnectCallback can change the transport entirely, so SocketException would be unexpected.

  2. I agree with what @stephentoub says. We should try to overcome the problem we brought in our 6.0 connection pooling change. This won't be an easy change though. We might still end up going with the breaking change of 15s ConnectTimeout instead or in addition to it.

The ideas were:

  • Always spawn a new connection if none were available - within connection limits (in line with .NET 5 behavior) -- what Stephen said above. But this will not help folks that explicitly use MaxConnectionPerServer=1
  • "Deprioritize" pending connection after some time (stop counting it against request queue size). As with the one above, this will not help if MaxConnectionPerServer=1
  • With a shorter connect timeout, retry connection creation, e.g. 3 times, while it is still within the originating request's timeout.
  • Somehow apply originating request's timeout to connection creation
  • If the connection was not connected after originating request is completed, tear it down right away or after some short timeout
  • If we end up with timed out requests while the connection was pending, tear down such connection

@karelz karelz added the enhancement Product code improvement that does NOT require public API changes/additions label Jun 14, 2022
@antonfirsov antonfirsov self-assigned this Jun 20, 2022
@antonfirsov
Copy link
Member

To me it looks like a combination of the following two mitigations would bring back the upsides of the old 5.0 connection and request cancellation behavior without hurting the benefits introduced in 6.0, and without the need of introducing an arbitrary ConnectTimeout value or other intrusive/breaking changes:

  1. Spawn a new connection if none is available - within limits, as recommended in Stephen's comment.

This will not help the MaxConnectionPerServer=1 case, therefore also:

  1. Link the connection's cancellation token with the cancellation token of the request that initiated it.

In case of MaxConnectionPerServer == 1 this would practically result in a behavior that is identical to 5.0: there could be only one connection attempt, which is always associated with the first request in the queue. If the request is cancelled, the connection attempt is also cancelled, they are failing together.

In case of MaxConnectionPerServer > 1 this would help against the accumulation of too many pending connection attempts to badly behaving servers. Thanks to #62935, a connection failure/cancellation should never fail an unrelated request. In fact, a connection failure helps serving follow-up requests, since in most cases it will spawn a new connection attempt. This would be also true if the cancellation happens because of the cancellation of the original unrelated request, therefore I see only positive effects from this change.

Note that point 2. alone does not solve the case when an earlier request with a long timeout creates a pending faulty connection that is stalling a later requests with a short timeout. From this perspective, this is why I'm recommending to combine it with more aggressive connection spawning.

@stephentoub @dotnet/ncl I wonder if I'm missing something or if you have some concerns regarding this idea?

@stephentoub
Copy link
Member

therefore I see only positive effects from this change.

The downsides I see to that are:

  • If a request is made and then quickly canceled, we'll end up tearing down a connection that could be perfectly usable almost immediately. We had the same issue in .NET 5, but in theory the pooling changes for .NET 6 fixed that, whether it was actually an issue in real workloads or not.
  • If another connection handles a request that spawned a new connection, and that connection takes a long time to be established, it'll be holding on to the resources associated with the initial request's CancellationToken. That might not be a big deal, but it's something to consider.
  • If the initial request completes successfully due to another connection picking it up, and then after it completes successfully its token is canceled, that could again end up tearing down a connection that would have otherwise been usable soon thereafter.

It might be the right tradeoffs, of course. Just pointing out it's not without possible issue.

@antonfirsov
Copy link
Member

We can make the linking of the request and connection cancellation conditional, and only do it if we cannot spawn new "extra connections" anymore, eg. because we are approaching the pool limit. (Meaning it would be the default behavior with MaxConnectionPerServer == 1).

This would mean that we would be able to handle problematic cases and constrained pools without hurting the performance of the general, optimistic path.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 7, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
10 participants