Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyper client hang in container #1549

Closed
tierriminator opened this issue Jan 5, 2024 · 1 comment · Fixed by #1550
Closed

Hyper client hang in container #1549

tierriminator opened this issue Jan 5, 2024 · 1 comment · Fixed by #1550

Comments

@tierriminator
Copy link

I've been hit with random client hangs recently when deploying an application, which uses azure blob storage, to a kubernetes cluster. I've tracked it down to hyperium/hyper#2312, which affects reqwest and therefore also this repo.
The workaround is setting pool_max_idle_per_host(0) for the client.

demoray pushed a commit to demoray/azure-sdk-for-rust that referenced this issue Jan 5, 2024
As indicated in Azure#1549, there is an issue with hyper (the underlying
layer used by reqwest) that hangs in some cases on connection pools.
This PR uses a commonly discussed workaround of setting
`pool_max_idle_per_host` to 0.

Ref: hyperium/hyper#2312
@demoray
Copy link
Contributor

demoray commented Jan 5, 2024

@tierriminator good catch. I was experiencing the same issue yesterday and had not tracked down why yet. Making this change addressed the issue for me as well.

You can work around it by creating your own HttpClient, but that isn't the most ergonomic.

I'll submit a PR for this shortly.

@demoray demoray linked a pull request Jan 5, 2024 that will close this issue
demoray added a commit that referenced this issue Jan 5, 2024
As indicated in #1549, there is an issue with hyper (the underlying
layer used by reqwest) that hangs in some cases on connection pools.
This PR uses a commonly discussed workaround of setting
`pool_max_idle_per_host` to 0.

Ref: hyperium/hyper#2312
github-merge-queue bot pushed a commit to neondatabase/neon that referenced this issue Nov 22, 2024
## Problem

close #9836

Looking at Azure SDK, the only related issue I can find is
Azure/azure-sdk-for-rust#1549. Azure uses
reqwest as the backend, so I assume there's some underlying magic
unknown to us that might have caused the stuck in #9836.

The observation is:
* We didn't get an explicit out of resource HTTP error from Azure.
* The connection simply gets stuck and times out.
* But when we retry after we reach the timeout, it succeeds.

This issue is hard to identify -- maybe something went wrong at the ABS
side, or something wrong with our side. But we know that a retry will
usually succeed if we give up the stuck connection.

Therefore, I propose the fix that we preempt stuck HTTP operation and
actively retry. This would mitigate the problem, while in the long run,
we need to keep an eye on ABS usage and see if we can fully resolve this
problem.

The reasoning of such timeout mechanism: we use a much smaller timeout
than before to preempt, while it is possible that a normal listing
operation would take a longer time than the initial timeout if it
contains a lot of keys. Therefore, after we terminate the connection, we
should double the timeout, so that such requests would eventually
succeed.

## Summary of changes

* Use exponential growth for ABS list timeout.
* Rather than using a fixed timeout, use a timeout that starts small and
grows
* Rather than exposing timeouts to the list_streaming caller as soon as
we see them, only do so after we have retried a few times

Signed-off-by: Alex Chi Z <chi@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants