-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138
Comments
I think this is a race condition with HTTP, not something we can fix here. I can probably fix the server-side assert (though it disappears when I add print statements to try and debug it...), but the client side disconnection is likely impossible without decreasing performance. Essentially, the timeline looks like this (and this happens all the time on dodgy servers which don't support keepalive, but also don't send the Connection: close header):
|
As long as it's possible for a client to send a request while a server sends a close event in HTTP/TCP, I don't think there's any way to solve this properly. There will always be a window of time where conflicting messages can be sent in both directions. In 3.10 we improve this by auto-retrying requests on idempotent methods, which increases the reliability of the client, and is exactly what the HTTP RFC recommends. |
OK, so server-side when checking the keepalive, it simply checks if This resolves the server-side error, and should also reduce the window for the client race condition to occur slightly. However, I still see the client error quite easily, and still think that's unsolvable. |
If you want to debug further yourself, basically, if data_received() in web_protocol.py is not called between the client sending the data and the keepalive callback happening, then the disconnection will happen. I believe that is called directly from asyncio, so if there are any other code problems that are increasing the race condition window, it's very likely they are in asyncio. But, even if the code is perfect and executes instantly, there will always be network latency where this race condition can happen. If it takes 20ms for the client data to get to us over the network, then there's atleast a 40ms window where we might close the connection before it arrives (or even before the client sends it). |
so for us we can't change the client because it's the AWS ALB proxying http/2 requests from external clients to aiohttp which doesn't support http/2. we believe this race condition ends up causing random 502 errors that are very hard to diagnose as it leaves no logs on the server side. I'll dig some more today. thanks for the pointers |
502 is caused by connection between proxy and server, right? So, you'd presumably want the proxy to retry a connection in that case, or extend the keepalive timeout to reduce the frequency. Ultimately, there's no such thing as a closing handshake at the HTTP level, closing is done at the TCP level. Because of this, the only way I can see it being 'fixed' is to implement a closing handshake of some sort at the application level (but, that would require similar support in the proxy...). Otherwise, you could reduce the frequency of issues by using longer keepalive (so it happens less frequently) and reducing latency between the machines (to reduce the window that the race condition can happen in). But, obviously these don't fix the problem, just reduce the likelihood. FYI, the relevant parts of the RFC that documents this and recommends retrying connections is: |
we can't change the ALB, that's an AWS service ;) |
I'm not familiar with ALB, but the one thing I've just thought of, and how this probably works for most situations, is that the keepalive should be handled on their end. Then it would be near impossible for this situation to happen, as the keepalive would likely only kill a connection when it wasn't handling a request, so there would be no reason for the server to be sending any data when the close event arrives. |
yes, I think the fundamental problem is that the aiohttp server keepalive timeout needs to be greater than the ALB's configured timeout (after inspecting ours it is not). I think probably the better default for aiohttp would be to not have a default keepalive timeout, that way the user is forced to pick one if a problem arises with connection's not getting GC'd. However I think it would be great if aiohttp could minimize the chances of the error as much as possible (always a good thing). Another thing I wanted to investigate is if instead of aiohttp immediately closing the transport, it could put the transport into a "closing" mode, where it can reject new incoming connections, and then after some time actually close it. I need to investigate other servers to see their behaviors. If it's common for this to happen then I'll just close this request. I'm imagining there must be some sort of standard for this. |
actually thinking more this I think it really is just a configuration issue as we're realizing. Aso you say we can close the socket while a request is being made remotely (IOW race between creating connection and noticing the socket got closed). I'm going to close unless I discover something else. But please make sure that aiohttp doesn't itself have race conditions between servicing a request and closing the socket. Thanks for taking the time to investigate, appreciate it! |
What's the default on ALB? The comment in the code suggests someone copied the default timeout from Nginx. I think it's reasonable to keep a default in case the app is exposed to clients directly. But, it's probably safe to increase it to something a lot higher, so it works better for proxies.
Connections, yes (we already do this in shutdown, potentially allowing for no dropped requests as you redeploy). Requests on an existing connection, no. As I mentioned, there's no such thing as a closing handshake or anything at the HTTP level. The connection gets closed at the TCP level, so there's no way to know at the HTTP level to know if it's safe to send a request or not.
I'm still preparing a test for the server-side issue above, so will link it to this issue. |
so with that you'd run into this issue fairly often. Someone previously set our server side timeout to 1hr which presents a race condition, I'm going to fix that on our side, hopefully resolving the issue |
Yeah, I was thinking an hour sounded about right. Maybe we change the default to 3630 or so.. |
nginx is 75s: https://www.baeldung.com/linux/nginx-timeouts ok now I want to create a sample nginx server with 1s timeout to see if it reproduces as well |
ok confirmed same thing happens with nginx serving up the same file with |
another option i suppose is making keepalive_timeout a require param with lots of docs so the user can make an informed decision instead of relying on a default that may/may not work for them, potentially creating hard to diagnose errors. |
@Dreamsorcerer reporting back: updating our timeout to be > than ALB timeout fixed the issue, thanks for taking the time looking at this! |
Describe the bug
as mentioned in #2682 we had an issue with keep-alive timeouts. There appears to be a race condition between how often requests are made and the server keep alive timeout. If they overlap often the server will cause disconnect errors on the client.
To Reproduce
Running the following code in 3.9.2/3.10.5:
In 3.9.2 you'll get errors like:
Running with 3.10.5:
Note if you swap to the aiohttp 3.10.5 client I can't get it to reproduce. I think it's because I saw something in a PR about server disconnects..so I think that's just hiding the underlying problem that the server is erroneously disconnecting the client.
Expected behavior
there should not be any errors in either version.
Logs/tracebacks
Python Version
Python 3.11.9
aiohttp Version
multidict Version
yarl Version
OS
Darwin amohr-TYQNW6P9KQ.fbn.org 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64
Related component
Server, Client
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: