3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138

thehesiod · 2024-09-13T01:31:18Z

Describe the bug

as mentioned in #2682 we had an issue with keep-alive timeouts. There appears to be a race condition between how often requests are made and the server keep alive timeout. If they overlap often the server will cause disconnect errors on the client.

To Reproduce

Running the following code in 3.9.2/3.10.5:

import asyncio
import random
import sys
import logging
import time

import aiohttp
import aiohttp.web
import requests
import requests.adapters

logger = logging.getLogger()


async def get_foo(req: aiohttp.web.Request):
    response = aiohttp.web.StreamResponse(status=200, headers={'Content-Type': 'text/csv'})

    logger.info("sending response")
    await response.prepare(req)
    await response.write(b'abc' * 512)
    # await asyncio.sleep(1)
    await response.write_eof()
    logger.info("wrote eof")
    return response


port = int(sys.argv[1])

async def server():
    app = aiohttp.web.Application()
    app.add_routes([aiohttp.web.get('/foo', get_foo)])
    await aiohttp.web._run_app(app, port=port, keepalive_timeout=1.0)


async def aiohttp_client(session: aiohttp.ClientSession):
    while True:
        async with session.get(f'http://localhost:{port}/foo', verify_ssl=False) as r:
            result = await r.text()
            logger.info(f'{result[:15]}...')

        await asyncio.sleep(random.uniform(0.99, 1.01))


def requests_client(session):
    while True:
        res = session.get(f'http://localhost:{port}/foo', verify=False)
        result = res.text
        logger.info(f'{result[:15]}...')
        time.sleep(random.uniform(0.99, 1.01))


async def main():
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)-8s %(message)s')
    server_task = asyncio.create_task(server())

    num_clients = 4
    await asyncio.sleep(2)

    loop = asyncio.get_event_loop()

    session = requests.Session()
    session.mount("https://", requests.adapters.HTTPAdapter(pool_connections=num_clients, pool_maxsize=num_clients))
    session.mount("http://", requests.adapters.HTTPAdapter(pool_connections=num_clients, pool_maxsize=num_clients))
    await loop.run_in_executor(None, requests_client, session)

    # connector = aiohttp.TCPConnector(limit=num_clients, limit_per_host=num_clients)
    # async with aiohttp.ClientSession(connector=connector) as session:
        # await asyncio.gather(*(aiohttp_client(session) for _ in range(num_clients)))

    await server_task
    
if __name__ == '__main__':
    asyncio.run(main())

In 3.9.2 you'll get errors like:

2024-09-12 18:25:05,528 INFO     sending response
2024-09-12 18:25:05,529 INFO     abcabcabcabcabc...
2024-09-12 18:25:05,530 INFO     wrote eof
2024-09-12 18:25:05,530 INFO     ::1 [12/Sep/2024:17:25:05 -0800] "GET /foo HTTP/1.1" 200 1691 "-" "python-requests/2.32.3"
Traceback (most recent call last):
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connection.py", line 464, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1395, in getresponse
    response.begin()
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 286, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/urllib3/connection.py", line 464, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1395, in getresponse
    response.begin()
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 286, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 71, in <module>
    asyncio.run(main())
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 64, in main
    await loop.run_in_executor(None, requests_client, session)
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 46, in requests_client
    res = session.get(f'http://localhost:{port}/foo', verify=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

Running with 3.10.5:

024-09-12 18:27:06,597 INFO     sending response
2024-09-12 18:27:06,599 INFO     wrote eof
2024-09-12 18:27:06,599 INFO     ::1 [12/Sep/2024:17:27:06 -0800] "GET /foo HTTP/1.1" 200 1692 "-" "python-requests/2.32.3"
2024-09-12 18:27:06,604 INFO     abcabcabcabcabc...
Traceback (most recent call last):
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connection.py", line 507, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1395, in getresponse
    response.begin()
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 294, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/urllib3/connection.py", line 507, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1395, in getresponse
    response.begin()
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 325, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 294, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 73, in <module>
    asyncio.run(main())
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 64, in main
    await loop.run_in_executor(None, requests_client, session)
  File "/opt/homebrew/Cellar/python@3.11/3.11.9/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/Library/Application Support/JetBrains/PyCharm2024.2/scratches/scratch.py", line 46, in requests_client
    res = session.get(f'http://localhost:{port}/foo', verify=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2024-09-12 18:27:07,631 ERROR    Task exception was never retrieved
future: <Task finished name='Task-59' coro=<RequestHandler.start() done, defined at ~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/aiohttp/web_protocol.py:493> exception=AssertionError()>
Traceback (most recent call last):
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/aiohttp/web_protocol.py", line 536, in start
    request = self._request_factory(message, payload, self, writer, handler)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/aiohttp/web_app.py", line 479, in _make_request
    return _cls(
           ^^^^^
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/aiohttp/web_request.py", line 841, in __init__
    super().__init__(*args, **kwargs)
  File "~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages/aiohttp/web_request.py", line 196, in __init__
    assert transport is not None
           ^^^^^^^^^^^^^^^^^^^^^
AssertionError

Note if you swap to the aiohttp 3.10.5 client I can't get it to reproduce. I think it's because I saw something in a PR about server disconnects..so I think that's just hiding the underlying problem that the server is erroneously disconnecting the client.

Expected behavior

there should not be any errors in either version.

Logs/tracebacks

(see above)

Python Version

Python 3.11.9

aiohttp Version

Name: aiohttp
Version: 3.9.2
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2
Location: ~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages
Requires: aiosignal, attrs, frozenlist, multidict, yarl
Required-by: aiobotocore, fbn-googleapi, fbn-pcdu-client3, fbn_auth_client, pytest-aiohttp

and

Name: aiohttp
Version: 3.10.5
Summary: Async http client/server framework (asyncio)
Home-page: https://github.com/aio-libs/aiohttp
Author: 
Author-email: 
License: Apache 2
Location: ~/.local/share/virtualenvs/commodities-uyI3VCa_/lib/python3.11/site-packages
Requires: aiohappyeyeballs, aiosignal, attrs, frozenlist, multidict, yarl
Required-by: aiobotocore

multidict Version

Name: multidict
Version: 6.1.0
Summary: multidict implementation
Home-page: https://github.com/aio-libs/multidict
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache 2
Location: ~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages
Requires: 
Required-by: aiohttp, yarl

yarl Version

Name: yarl
Version: 1.9.9
Summary: Yet another URL library
Home-page: https://github.com/aio-libs/yarl
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: Apache-2.0
Location: ~/.local/share/virtualenvs/ag_data-HNY644fZ/lib/python3.11/site-packages
Requires: idna, multidict
Required-by: aiohttp

OS

Darwin amohr-TYQNW6P9KQ.fbn.org 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000 arm64

Related component

Server, Client

Additional context

No response

Code of Conduct

I agree to follow the aio-libs Code of Conduct

The text was updated successfully, but these errors were encountered:

Dreamsorcerer · 2024-09-13T11:50:27Z

I think this is a race condition with HTTP, not something we can fix here. I can probably fix the server-side assert (though it disappears when I add print statements to try and debug it...), but the client side disconnection is likely impossible without decreasing performance.

Essentially, the timeline looks like this (and this happens all the time on dodgy servers which don't support keepalive, but also don't send the Connection: close header):

Client                     | Server
------------------------------------------------------
Send request
                             Receive request
                             Send response
Receive response
TCP acks the response
Sends next request           Receives ack and closes connection
Receives close event         Request is rejected because connection is already closed/closing

Dreamsorcerer · 2024-09-13T11:55:00Z

As long as it's possible for a client to send a request while a server sends a close event in HTTP/TCP, I don't think there's any way to solve this properly. There will always be a window of time where conflicting messages can be sent in both directions.

In 3.10 we improve this by auto-retrying requests on idempotent methods, which increases the reliability of the client, and is exactly what the HTTP RFC recommends.

Dreamsorcerer · 2024-09-13T12:29:43Z

OK, so server-side when checking the keepalive, it simply checks if self._waiter is not None. There's a race condition where the data may be received and the waiter is set, but then the keepalive check is scheduled before the code that is waiting on the waiter. Adding a waiter.done() check to the keepalive code stops it from incorrectly closing the connection.

This resolves the server-side error, and should also reduce the window for the client race condition to occur slightly. However, I still see the client error quite easily, and still think that's unsolvable.

Dreamsorcerer · 2024-09-13T12:43:03Z

If you want to debug further yourself, basically, if data_received() in web_protocol.py is not called between the client sending the data and the keepalive callback happening, then the disconnection will happen.

I believe that is called directly from asyncio, so if there are any other code problems that are increasing the race condition window, it's very likely they are in asyncio. But, even if the code is perfect and executes instantly, there will always be network latency where this race condition can happen. If it takes 20ms for the client data to get to us over the network, then there's atleast a 40ms window where we might close the connection before it arrives (or even before the client sends it).

thehesiod · 2024-09-13T14:20:31Z

so for us we can't change the client because it's the AWS ALB proxying http/2 requests from external clients to aiohttp which doesn't support http/2. we believe this race condition ends up causing random 502 errors that are very hard to diagnose as it leaves no logs on the server side. I'll dig some more today. thanks for the pointers

Dreamsorcerer · 2024-09-13T14:45:28Z

502 is caused by connection between proxy and server, right? So, you'd presumably want the proxy to retry a connection in that case, or extend the keepalive timeout to reduce the frequency.

Ultimately, there's no such thing as a closing handshake at the HTTP level, closing is done at the TCP level. Because of this, the only way I can see it being 'fixed' is to implement a closing handshake of some sort at the application level (but, that would require similar support in the proxy...).

Otherwise, you could reduce the frequency of issues by using longer keepalive (so it happens less frequently) and reducing latency between the machines (to reduce the window that the race condition can happen in). But, obviously these don't fix the problem, just reduce the likelihood.

FYI, the relevant parts of the RFC that documents this and recommends retrying connections is:
https://www.rfc-editor.org/rfc/rfc9112#name-retrying-requests

thehesiod · 2024-09-13T16:36:14Z

we can't change the ALB, that's an AWS service ;)

Dreamsorcerer · 2024-09-13T18:15:02Z

I'm not familiar with ALB, but the one thing I've just thought of, and how this probably works for most situations, is that the keepalive should be handled on their end. Then it would be near impossible for this situation to happen, as the keepalive would likely only kill a connection when it wasn't handling a request, so there would be no reason for the server to be sending any data when the close event arrives.

thehesiod · 2024-09-13T19:06:50Z

yes, I think the fundamental problem is that the aiohttp server keepalive timeout needs to be greater than the ALB's configured timeout (after inspecting ours it is not). I think probably the better default for aiohttp would be to not have a default keepalive timeout, that way the user is forced to pick one if a problem arises with connection's not getting GC'd. However I think it would be great if aiohttp could minimize the chances of the error as much as possible (always a good thing). Another thing I wanted to investigate is if instead of aiohttp immediately closing the transport, it could put the transport into a "closing" mode, where it can reject new incoming connections, and then after some time actually close it. I need to investigate other servers to see their behaviors. If it's common for this to happen then I'll just close this request. I'm imagining there must be some sort of standard for this.

thehesiod · 2024-09-13T19:20:02Z

actually thinking more this I think it really is just a configuration issue as we're realizing. Aso you say we can close the socket while a request is being made remotely (IOW race between creating connection and noticing the socket got closed). I'm going to close unless I discover something else. But please make sure that aiohttp doesn't itself have race conditions between servicing a request and closing the socket. Thanks for taking the time to investigate, appreciate it!

Dreamsorcerer · 2024-09-13T19:21:05Z

I think probably the better default for aiohttp would be to not have a default keepalive timeout, that way the user is forced to pick one if a problem arises with connection's not getting GC'd.

What's the default on ALB? The comment in the code suggests someone copied the default timeout from Nginx. I think it's reasonable to keep a default in case the app is exposed to clients directly. But, it's probably safe to increase it to something a lot higher, so it works better for proxies.

it could put the transport into a "closing" mode, where it can reject new incoming connections, and then after some time actually close it.

Connections, yes (we already do this in shutdown, potentially allowing for no dropped requests as you redeploy). Requests on an existing connection, no. As I mentioned, there's no such thing as a closing handshake or anything at the HTTP level. The connection gets closed at the TCP level, so there's no way to know at the HTTP level to know if it's safe to send a request or not.

I need to investigate other servers to see their behaviors. If it's common for this to happen then I'll just close this request. I'm imagining there must be some sort of standard for this.

I'm still preparing a test for the server-side issue above, so will link it to this issue.

thehesiod · 2024-09-13T20:42:49Z

default ALB keepalive timeout is 1hr: https://aws.amazon.com/about-aws/whats-new/2024/03/application-load-balancer-http-keepalive-duration/#:~:text=You%20can%20now%20set%20a,default%20value%20is%203600%20seconds.

thehesiod · 2024-09-13T20:43:46Z

so with that you'd run into this issue fairly often. Someone previously set our server side timeout to 1hr which presents a race condition, I'm going to fix that on our side, hopefully resolving the issue

Dreamsorcerer · 2024-09-13T20:52:08Z

Yeah, I was thinking an hour sounded about right. Maybe we change the default to 3630 or so..

thehesiod · 2024-09-13T20:54:16Z

nginx is 75s: https://www.baeldung.com/linux/nginx-timeouts ok now I want to create a sample nginx server with 1s timeout to see if it reproduces as well

thehesiod · 2024-09-13T21:15:24Z

ok confirmed same thing happens with nginx serving up the same file with keepalive_timeout 1;

thehesiod · 2024-09-13T21:19:47Z

another option i suppose is making keepalive_timeout a require param with lots of docs so the user can make an informed decision instead of relying on a default that may/may not work for them, potentially creating hard to diagnose errors.

thehesiod · 2024-09-24T16:38:33Z

@Dreamsorcerer reporting back: updating our timeout to be > than ALB timeout fixed the issue, thanks for taking the time looking at this!

thehesiod added the bug label Sep 13, 2024

thehesiod closed this as completed Sep 13, 2024

Dreamsorcerer mentioned this issue Sep 13, 2024

Fix keepalive race condition #9140

Merged

Dreamsorcerer mentioned this issue Sep 24, 2024

Increase default keepalive_timeout server-side. #9285

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138

3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024 •

edited

Loading

Dreamsorcerer commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 24, 2024

3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138

3.9.2 server keep-alive timeout can cause disconnect of active connections and hence client server disconnect errors #9138

Comments

thehesiod commented Sep 13, 2024

Describe the bug

To Reproduce

Expected behavior

Logs/tracebacks

Python Version

aiohttp Version

multidict Version

yarl Version

OS

Related component

Additional context

Code of Conduct

Dreamsorcerer commented Sep 13, 2024 • edited Loading

Dreamsorcerer commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

Dreamsorcerer commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 13, 2024

thehesiod commented Sep 24, 2024

Dreamsorcerer commented Sep 13, 2024 •

edited

Loading