Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection closed when request is attempted #272

Open
salex89 opened this issue May 14, 2024 · 8 comments
Open

Connection closed when request is attempted #272

salex89 opened this issue May 14, 2024 · 8 comments

Comments

@salex89
Copy link

salex89 commented May 14, 2024

This might be a re-hash of these and other similar issues:
#62
elixir-tesla/tesla#467

However, I indeed think something strange is happening, along with the other folks who commented on other issues, some very recent and mimic my experience. The issue is that certain requests fail with a {:error, "socket closed"} error, without actually being executed at all and not due to any real connectivity issue. As a sidenote, the {:error, "socket closed"] result just the {:error, %Mint.TransportError{reason: :closed} returned by Finch wrapped by the Tesla adapter. But that's beside the point, that just says that the adapter is an unlikely culprit.

These issues were much more common before I set the pool_max_idle_time to 60_000 down from the default :infinity, like 10 times more common. Yet the error still happens. I setup telemetry to try and gather more data on the issue. After capturing few such cases, this is the pattern I usually see (telemetry events forwarded to logs):

16:16:59.565 [info] Finch telemetry: reused_connection.
16:16:59.566 [info] Finch telemetry: http request sent, idle_time: 115862688122, method: DELETE, path: <redacted>
16:16:59.566 [info] Finch telemetry: receive: start, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.749 [info] Finch telemetry: pool idle timeout, host: <redacted>, :pool_max_idle_time_exceeded.
16:17:00.750 [info] Finch telemetry: receive: stop, idle_time: 115862688122, method: DELETE path: <redacted>
16:17:00.750 <socket closed error here in my code>
...
<exponential backoff>
<retry succeeds with fresh connection with a sane idle_time>

Please notice two things, the idle_time is always a very large number when this happens, this is the equivalent of 3.7 years. Successful requests have an expected idle_time, less than 60 seconds. I also noticed (I believe always) that there is a Finch telemetry: pool idle timeout event when this error happens like immediately, a millisecond, before the connection closes. I didn't have telemetry from back when the pool_max_idle_time was at it's default (:infinity) but the the number of errors such was greater. It's a bit hard for me to revert it back to :infinity and test now easily because this code is deployed to an environment where it's being used and would increase error rate downstream potentially. And we need some traffic to make it happen, but if it would help debugging, I could try to make it work.

If the pool shuts down/restarts because of the idle time, I don't think it should have done that. There was more than a second delay after the request and the pool being closed on itself, it should have registered that it's still in use and stay alive with connections open. On the flip side, I don't have an explanation for the idle time, though, as it was a large number to begin with.

We see this a few times each day towards different hosts, maybe 2-6 events daily, but on a mildly utilised system. We expect more traffic soon. Also, we used Hackney before migrating to Finch and didn't see similar issues. Looking back at it, I don't think Hackney did any retries by itself. We migrated to Finch because of the more advanced connection pool, but we had to build a lot of defensive logic around this issue which we didn't need. Which is kind of fine, connections can be broken or unstable and it should be taken care , but this still bothers me because these are not real connectivity issues in my opinion :-) . I would rather improve on this and keep Finch than migrate back to Hackney.

Please advise which more information can I send. I'll continue looking on my side, but I'm not yet familiar with Finch's source code yet.

Edit: I've just seen this issue: #269 and it really resembles my case, and we're indeed using OTP 26.1.2, but the OP had issues with any connection to the affected host. Mine are haphazard.

@keathley
Copy link
Collaborator

The idle times you're seeing there are based on System.monotonic_time which is in native units. If those are converted to seconds you end up with 115, a much more reasonable value, but still outside of the max idle time that's been set.

What you're seeing here makes sense based on the logic of the http1 pool. A connection is checked out, finch determines that its passed the idle time that has been specified, executes the telemetry event, and closes the connection. I suspect why you were seeing more of these errors in the past is because connections were being closed by the server and finch believed it still had an open connection when it fact it didn't. The max idle time helps avoid this because it causes finch to close connections more often. But somewhere in the chain of events, information is being lost or not handled properly. This same thing has plagued people in the past and as far as I know no one has been able to reliably reproduce it or get a packet capture or similar to see exactly what's going on.

@salex89
Copy link
Author

salex89 commented May 21, 2024

The idle times you're seeing there are based on System.monotonic_time which is in native units. If those are converted to seconds you end up with 115, a much more reasonable value, but still outside of the max idle time that's been set.

What you're seeing here makes sense based on the logic of the http1 pool. A connection is checked out, finch determines that its passed the idle time that has been specified, executes the telemetry event, and closes the connection. I suspect why you were seeing more of these errors in the past is because connections were being closed by the server and finch believed it still had an open connection when it fact it didn't. The max idle time helps avoid this because it causes finch to close connections more often. But somewhere in the chain of events, information is being lost or not handled properly. This same thing has plagued people in the past and as far as I know no one has been able to reliably reproduce it or get a packet capture or similar to see exactly what's going on.

Thank you for your response and for explaining monotonic time. Well, if I'm understanding things correctly, 115 seconds is more than the pool_max_idle_time, but that doesn't make any difference nor should it. It's actually less than the conn_max_idle_time set to :infinity by default. If the pool has been relatively active, let's say requests were going out for several minutes using it, but that particular connection was not being checked out, it's idle time may have been so quite long once the connection was asked for. Consequently being closed. It seems that the conn_max_idle_time should be set to a lower duration, not necessarily the pool_max_idle_time. Does that make any sense?

However, the documentation states:

:conn_max_idle_time ([timeout/0]) - The maximum number of milliseconds an HTTP1 connection is allowed to be idle before being closed during a checkout attempt. The default value is :infinity.

What does that mean? Will Finch also return an error if the idle time is too long? I'll certainly look into the code and try it out, I'm just trying to piece together the desired behaviour. Got it, it's a NimblePool feature. If everything is correct, NimblePool will remove the idle connection and fetch a new one.

Another question, can Finch actually detect that the connection is closed without trying an http request, moving to receive and returning an error? Can Finch just checkout another connection in that case or does it monitor connections?

Edit: Also, it's still not clear to me why is the pool closed in this case. It obviously wasn't idle since the connection was checked out more than a second before the pool was closed.

@keathley
Copy link
Collaborator

Another question, can Finch actually detect that the connection is closed without trying an http request, moving to receive and returning an error? Can Finch just checkout another connection in that case or does it monitor connections?

In theory, yes. When the connection is checked into the pool, the pool monitors incoming messages and tcp events and updates the sockets accordingly. What seems to be happening is something like this:

  • finch establishes connection A with server
  • connection A sits idle
  • Server closes or half closes its socket to connection A
  • Connection A is checked out of the pool
  • A request is sent through connection A at which point its determined that the socket has been closed and an error is returned.

Its never been entirely clear how this can happen. There's a long issue where lots of folks tried to work out the root cause. But it was never really found: #62.

@keathley
Copy link
Collaborator

Just to add a bit more: @salex89 if there's any way you can get a packet capture of this happening, that would go a long way towards determining the root of the issue.

@salex89
Copy link
Author

salex89 commented May 23, 2024

Just to add a bit more: @salex89 if there's any way you can get a packet capture of this happening, that would go a long way towards determining the root of the issue.

I talked to our ops team and we should be able to do that. At least on our staging environment which should be enough since we happen to see these events there. Do you have any suggestion on which monitor/capture to use? Maybe tshark, tcpdump, something third?

@keathley
Copy link
Collaborator

@salex89 awesome! either tshark or tcpdump are what I would use so I don't think you'll go wrong with either. Thanks.

@adriencohen
Copy link

adriencohen commented Jun 3, 2024

I had a similar issue. I'm not sure if this is exactly the same thing, but here is what happened:

  1. Our application starts.
  2. We made multiple requests to an external API using Req everything is fine.
  3. We stopped making requests for quite a while.
  4. When we tried to request the external API again, the first request failed with %Mint.TransportError{reason: :closed}, but the subsequent ones were successful.

What happened was :

  • pool_max_idle_time was :infinity
  • We are on AWS and the NAT configuration is set to drop any idle connection after 350s
  • I guess the first request fails due to socket being closed. The second one creates a new pool and it works again
  • After switching to a pool_max_idle_time just under the 350s all requests are completed correctly

Looks like it's possible to detect this behavior since AWS says that a RST packet is sent before closing the connection.

@salex89
Copy link
Author

salex89 commented Jun 14, 2024

I had a similar issue. I'm not sure if this is exactly the same thing, but here is what happened:

  1. Our application starts.
  2. We made multiple requests to an external API using Req everything is fine.
  3. We stopped making requests for quite a while.
  4. When we tried to request the external API again, the first request failed with %Mint.TransportError{reason: :closed}, but the subsequent ones were successful.

What happened was :

  • pool_max_idle_time was :infinity
  • We are on AWS and the NAT configuration is set to drop any idle connection after 350s
  • I guess the first request fails due to socket being closed. The second one creates a new pool and it works again
  • After switching to a pool_max_idle_time just under the 350s all requests are completed correctly

Looks like it's possible to detect this behavior since AWS says that a RST packet is sent before closing the connection.

Our setup is also behind an AWS NAT, but it looks like the pool_max_idle_time is not completely fixing the issue, as I first said. I promised I'll put a packet analyzer, I just haven't had the time to do it yet, something came up. But I'll look into it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants