-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncConnectionPool hangs future connections sometimes when a connection get canceled. #830
Comments
Hi @fusiyuan2010! Can you give a minimum example of how we can reproduce this issue? |
@karpetrosyan Unfortunately, it's hard to give a working example that can easily reproduce the issue since the probability of this happening is low. I can describe the scenario of how this can happen: I have a typical web server serving requests. In the request handler, I use this global AsyncConnectionPool object, which is shared across all requests, to call another service. The current request handler can fail while calling HTTP to upstream service, for whatever reason but not related to this HTTP call itself, for example:
If main_task throws an exception at the exact time the handle_async_request function runs in the self._close_expired_connections(), the task_http_call is canceled with the current request left in the |
We currently have issues with our async cancellation support; also, catching each await and handling cacnel to avoid breaking the client state does not make sense to me. |
Hey @tomchristie, is there any way we could prioritize this? It blocks adoption of I'd be happy to fund/sponsor the work, either with funds to Encode or with a contribution from someone on our team. |
Looks like we overlooked this problem. I've got a couple pull requests that fix various situations when the connection pool breaks after cancellation. I've also started a discussion on this subject (see #844). The problem is really complex; to overcome it, we must either check out every await and ensure that it will not break the connection pool state, or we must resolve it in another layer. We could use the shielded scope in I'll go over this problem again, and if we can't handle all of the scenarios, I'll submit a pull request that uses shield everywhere in the connection pool layer while disabling it for network activities. |
Yes absolutely.
A helpful way to resource this would be helping us reliably replicate the issue you're seeing. Are you able to get a minimal example together that demos the case? |
Thank you so much Tom & Kar!
Sadly not – the OpenAI team that saw this saw it on a very low percentage (but high absolute number) of requests, and is already porting the relevant code to aiohttp unfortunately (my hope here is to stem the bleeding). I've asked and they don't have time right now to try to devise a minimal repro, which is a shame. Folks on our team would be starting from the same starting point as yourself (well, without the knowledge of httpcore internals you have). I really wish I could help on this, as I know quite well how devilish it is to try to debug things like this without a consistent repro script. |
I don't think it's possible to reproduce such issues where async cancellations in a specific place break the application state; instead, we can say in which line we should |
@rattrayalex It looks clear to me how to resolve this (everywhere we hold the pool lock, we also shield against cancellation). I'll get a branch up with that in place. |
That's great to hear, thank you @tomchristie ! |
In
async def handle_async_request(self, request: Request) -> Response:
there are such lines:There is a probability that the coroutine got canceled after
self._requests.append(status)
but beforeawait self._attempt_to_acquire_connection
. Since these few lines are not protected against canceling nor having cleaning up on exception, it may leave a status stayed in _requests forever, blocking all futureconnection = await status.wait_for_connection(timeout=timeout)
which is a few lines after.The text was updated successfully, but these errors were encountered: