Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncConnectionPool hangs future connections sometimes when a connection get canceled. #830

Closed
fusiyuan2010 opened this issue Oct 24, 2023 · 10 comments · Fixed by #880
Closed

Comments

@fusiyuan2010
Copy link

In async def handle_async_request(self, request: Request) -> Response: there are such lines:

        async with self._pool_lock:
            self._requests.append(status)
            await self._close_expired_connections()
            await self._attempt_to_acquire_connection(status)

There is a probability that the coroutine got canceled after self._requests.append(status) but before await self._attempt_to_acquire_connection. Since these few lines are not protected against canceling nor having cleaning up on exception, it may leave a status stayed in _requests forever, blocking all future connection = await status.wait_for_connection(timeout=timeout) which is a few lines after.

@fusiyuan2010 fusiyuan2010 changed the title AsyncConnectionPool hangs future connections sometimes when a connection get cancelled. AsyncConnectionPool hangs future connections sometimes when a connection get canceled. Oct 24, 2023
@karpetrosyan
Copy link
Member

Hi @fusiyuan2010!

Can you give a minimum example of how we can reproduce this issue?

@fusiyuan2010
Copy link
Author

fusiyuan2010 commented Oct 24, 2023

@karpetrosyan Unfortunately, it's hard to give a working example that can easily reproduce the issue since the probability of this happening is low. I can describe the scenario of how this can happen:

I have a typical web server serving requests. In the request handler, I use this global AsyncConnectionPool object, which is shared across all requests, to call another service. The current request handler can fail while calling HTTP to upstream service, for whatever reason but not related to this HTTP call itself, for example:

@app.get("/some_api")
async def main():
    req = make_upstream_request("http://upstream-service")
    task_http_call = asyncio.create_task(global_client_pool.handle_async_request(req))
    main_task = asyncio.create_task(other_tasks())
    await asyncio.gather(task_http_call, main_task)

If main_task throws an exception at the exact time the handle_async_request function runs in the self._close_expired_connections(), the task_http_call is canceled with the current request left in the self._requests forever. This left the pool in a bad state that all future calls to _attempt_to_acquire_connection will return False immediately despite there being no connections in the pool.

@karpetrosyan
Copy link
Member

We currently have issues with our async cancellation support; also, catching each await and handling cacnel to avoid breaking the client state does not make sense to me.
Perhaps we could shield all connection pool activities to ensure that the connection pool is never broken 🤔 .

See also: #785, #757

@rattrayalex
Copy link

rattrayalex commented Jan 26, 2024

Hey @tomchristie, is there any way we could prioritize this? It blocks adoption of httpx (and thus the openai python package) at OpenAI – their infra team is currently migrating off httpx to aiohttp solely due to this issue.

I'd be happy to fund/sponsor the work, either with funds to Encode or with a contribution from someone on our team.

@karpetrosyan
Copy link
Member

Looks like we overlooked this problem.

I've got a couple pull requests that fix various situations when the connection pool breaks after cancellation.

See #757 #785

I've also started a discussion on this subject (see #844).

The problem is really complex; to overcome it, we must either check out every await and ensure that it will not break the connection pool state, or we must resolve it in another layer.

We could use the shielded scope in handle_request methods, but disable it for activities that do not touch the connection pool. If we shield the entire handle_request method, the program may hang, even if it is cancelled.

I'll go over this problem again, and if we can't handle all of the scenarios, I'll submit a pull request that uses shield everywhere in the connection pool layer while disabling it for network activities.

@tomchristie
Copy link
Member

is there any way we could prioritize this?

Yes absolutely.

I'd be happy to fund/sponsor the work.

A helpful way to resource this would be helping us reliably replicate the issue you're seeing. Are you able to get a minimal example together that demos the case?

@rattrayalex
Copy link

Thank you so much Tom & Kar!

Are you able to get a minimal example together that demos the case?

Sadly not – the OpenAI team that saw this saw it on a very low percentage (but high absolute number) of requests, and is already porting the relevant code to aiohttp unfortunately (my hope here is to stem the bleeding). I've asked and they don't have time right now to try to devise a minimal repro, which is a shame.

Folks on our team would be starting from the same starting point as yourself (well, without the knowledge of httpcore internals you have).

I really wish I could help on this, as I know quite well how devilish it is to try to debug things like this without a consistent repro script.

@karpetrosyan
Copy link
Member

A helpful way to resource this would be helping us reliably replicate the issue you're seeing. Are you able to get a minimal example together that demos the case?

I don't think it's possible to reproduce such issues where async cancellations in a specific place break the application state; instead, we can say in which line we should raise CancelledError to reproduce the issue.

@tomchristie
Copy link
Member

@rattrayalex It looks clear to me how to resolve this (everywhere we hold the pool lock, we also shield against cancellation). I'll get a branch up with that in place.

@rattrayalex
Copy link

That's great to hear, thank you @tomchristie !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants