-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change SocketsHttpHandler's default ConnectTimeout to a reasonable finite value #66297
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWindows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more. The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default ( We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0. I propose to unify Windows and Linux timeouts, i.e. set
|
This seems reasonable. Technically it's a breaking change, but if a connection is taking that long to be established, it's likely going to fail, anyway. There's also value in consistency across platforms. Is the exception that occurs in the case of a timeout due to ConnectTimeout the same that occurs in response to the OS timing out a connection? |
Unfortunately, no. If OS did the timeout, the error would bubble up from sockets, so we would see If it was our timeout, we would see |
Perhaps we should use 15s. That means that typically we'd do the timeout ourselves on both platforms. It's also a reasonably round number as compared to 21s. |
Since we try connecting to all IPs returned by DNS (currently in serial), would changing the |
You are right 😢 and we currently don't have any per-IP timeout for multi-connect in Sockets (only a cancellation token for the whole operation) -- should we consider some new API like that? |
Only in case of Windows. In case of Linux, we're not losing anything, since the OS timeout ends up being ~2 minutes. The user can always change it back to infinite if this proves problematic. They can also implement their own Eventually we could do something like happy eyeballs and try to connect to multiple IPs in parallel (related #26177). |
Even with eyeballs, you can have multiple addresses for the same family. And only some of them may be unreachable/unresponsive. But I would feel in most cases callers would care about overall responsiveness instead of details for each individual address. |
By "something like happy eyeballs" I meant trying the parallel connect, even to multiple addresses of the same family at the same time. This is just an idea so please do not crucify me on all the details and pitfalls. I'm aware this wouldn't be as easy as running n connects in n tasks and seeing which finishes first. It would need to be carefully designed to not to drag down perf, hog ephemeral ports etc. |
Linking the existing issue containing API for socket's happy-eyeball-like parallel connect #33418 |
Triage: change |
The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool. The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection. Related to #66297
The additional trace should help troubleshoot hanged "pending" connection issues (i.e. the reasons why new connections were not created) in HttpConnectionPool. The issue manifests as growing number of timeouted requests. It is not easy to link a timeouted request (esp. with a small timeout) to a connection that was created 3 minutes ago but still is not connected. There are no traces of the state of HttpConnectionPool, e.g. number of pending connections, so it is unclear from existing traces why the pool decided not to create a connection. Related to dotnet#66297
I've put up a PR to revert this change #68649 The change proved to be problematic as-is, because
I'm reopening the issue for further discussion. |
Triage: we should address the issue with long OS timeouts in 7.0 in one way or another (retaining the old exception type, tuning OS timeouts, reworking pooling logic in some way). |
As a |
Triage: Let's separate 2 parts here:
|
We already follow the pattern "TaskCanceledException with inner TimeoutException" in both places: runtime/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs Lines 607 to 612 in 215b39a
and for connect timeout runtime/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs Lines 435 to 443 in 975aa54
There was a huge discussion in #21965 on what is better to do on timeouts, which ended up introducing this specific pattern. In general, we didn't want to change the exception type as historically TCE was thrown, and we just added a way to programmatically check whether it was a timeout (by putting TimeoutException inside) Re mimicking SocketException, it is a bit funny given that we make sure we throw OCE and not SocketException on cancellation in Sockets 😄 runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs Lines 120 to 124 in 975aa54
But there's no way on Socket level to say that the token signifies the timeout though. I'm also not sure it would be ok to break existing behavior of ConnectTimeout, e.g. someone might depend that TCE was thrown on it? |
We discussed it within the team, and we don't see a solution that would not be breaking in one way or another.
Re following pattern of TCE with inner |
I vote for simulating a
If the |
|
As I mentioned offline, I continue to think that this is working around a symptom rather than addressing the root of the problem. The core problem stems from a change made in .NET 6. When a request arrives and there’s no connection immediately available in the pool and there’s no MaxConnectionsPerServer limit set that’s been reached, a new connection will be created. In .NET 5 and earlier, that connection was dedicated to this specific request and vice versa: this request would be made on this connection and only this connection once it was ready. The downside to this is, if it takes a while to establish that connection and another connection becomes available in the meantime, that request would continue waiting for the connection it spawned. The upside to this, however, is if the connection took a long time to be established (or took so long to be established that it timed out), the only request it would harm (again assuming no MaxConnectionsPerServer limit reached) is the one request with which it was associated. In .NET 6, this behavior was changed, improving the downside but harming the upside. When a request arrives and there’s no connection available (including one that's pending), a new one is still immediately spawned, but it’s no longer associated with only that request and that request is no longer associated only with it. If another connection becomes available in the meantime, that newly-available connection can service the request. That improves the aforementioned downside. But now we have this pending connection that’s been spawned, and so when the next request arrives, it’ll see there’s already a connection pending and will try to use that one rather than spawning another. If this connection will never become usable for whatever reason, any requests that associate with it will effectively stall as well, such that this one request can now take out multiple requests rather than just one. The discussed ConnectTimeout change is an attempt to minimize the impact of this by lessening the default time that connection might be pending and thus lessening the number of requests that associate with it, but there are other ways to address this. Note as well that a relatively recent .NET 7 change made it so that when the connection eventually fails, it only fails the request that originally spawned it; if that request has already moved on or completed or been canceled, the connection’s failure ends up being silent (other than logging). For example, we could change the pooling to avoid considering a pending connection as one that can be used for an incoming request; if a new request arrives and there’s no immediately usable connection to service the request, create a new one. If we’re worried about a backlog of pending connections to a bad server building up, we can have some internal limit on how many such pending connections are allowed. (If there is a MaxConnectionsPerServer and it’s been reached, then we wait just as we always did). Instead or in addition, we could also lower the connection timeout once the connection is no longer associated with the original request, since it’s not going to fail anyone else anyway. |
The ideas were:
|
To me it looks like a combination of the following two mitigations would bring back the upsides of the old 5.0 connection and request cancellation behavior without hurting the benefits introduced in 6.0, and without the need of introducing an arbitrary
This will not help the MaxConnectionPerServer=1 case, therefore also:
In case of In case of Note that point 2. alone does not solve the case when an earlier request with a long timeout creates a pending faulty connection that is stalling a later requests with a short timeout. From this perspective, this is why I'm recommending to combine it with more aggressive connection spawning. @stephentoub @dotnet/ncl I wonder if I'm missing something or if you have some concerns regarding this idea? |
The downsides I see to that are:
It might be the right tradeoffs, of course. Just pointing out it's not without possible issue. |
We can make the linking of the request and connection cancellation conditional, and only do it if we cannot spawn new "extra connections" anymore, eg. because we are approaching the pool limit. (Meaning it would be the default behavior with This would mean that we would be able to handle problematic cases and constrained pools without hurting the performance of the general, optimistic path. |
Windows will timeout a connection attempt after ~21s. Linux apparently waits much longer – around 3 minutes or more.
The connection pooling changes in 6.0 made it so that a pending connection is not associated with a specific request. As such, request cancellation no longer causes pending connections to be canceled. We do not impose any connect timeout by default (
DefaultConnectTimeout = Timeout.InfiniteTimeSpan
), so we are now exposed to Linux’s long connection timeout. In case of small request timeouts, this may lead to a potentially big number of requests timing out as they try to wait/reuse the same "pending" connection.We should set a default ConnectTimeout to something reasonably small. We should also consider backporting that to 6.0.
I propose to unify Windows and Linux timeouts, i.e. set
DefaultConnectTimeout
to 21s.The text was updated successfully, but these errors were encountered: