Hyper client hang in container #1549

tierriminator · 2024-01-05T10:16:27Z

I've been hit with random client hangs recently when deploying an application, which uses azure blob storage, to a kubernetes cluster. I've tracked it down to hyperium/hyper#2312, which affects reqwest and therefore also this repo.
The workaround is setting pool_max_idle_per_host(0) for the client.

The text was updated successfully, but these errors were encountered:

As indicated in Azure#1549, there is an issue with hyper (the underlying layer used by reqwest) that hangs in some cases on connection pools. This PR uses a commonly discussed workaround of setting `pool_max_idle_per_host` to 0. Ref: hyperium/hyper#2312

demoray · 2024-01-05T16:14:07Z

@tierriminator good catch. I was experiencing the same issue yesterday and had not tracked down why yet. Making this change addressed the issue for me as well.

You can work around it by creating your own HttpClient, but that isn't the most ergonomic.

I'll submit a PR for this shortly.

As indicated in #1549, there is an issue with hyper (the underlying layer used by reqwest) that hangs in some cases on connection pools. This PR uses a commonly discussed workaround of setting `pool_max_idle_per_host` to 0. Ref: hyperium/hyper#2312

## Problem close #9836 Looking at Azure SDK, the only related issue I can find is Azure/azure-sdk-for-rust#1549. Azure uses reqwest as the backend, so I assume there's some underlying magic unknown to us that might have caused the stuck in #9836. The observation is: * We didn't get an explicit out of resource HTTP error from Azure. * The connection simply gets stuck and times out. * But when we retry after we reach the timeout, it succeeds. This issue is hard to identify -- maybe something went wrong at the ABS side, or something wrong with our side. But we know that a retry will usually succeed if we give up the stuck connection. Therefore, I propose the fix that we preempt stuck HTTP operation and actively retry. This would mitigate the problem, while in the long run, we need to keep an eye on ABS usage and see if we can fully resolve this problem. The reasoning of such timeout mechanism: we use a much smaller timeout than before to preempt, while it is possible that a normal listing operation would take a longer time than the initial timeout if it contains a lot of keys. Therefore, after we terminate the connection, we should double the timeout, so that such requests would eventually succeed. ## Summary of changes * Use exponential growth for ABS list timeout. * Rather than using a fixed timeout, use a timeout that starts small and grows * Rather than exposing timeouts to the list_streaming caller as soon as we see them, only do so after we have retried a few times Signed-off-by: Alex Chi Z <chi@neon.tech>

demoray mentioned this issue Jan 5, 2024

work around hang issue in hyper #1550

Merged

demoray linked a pull request Jan 5, 2024 that will close this issue

work around hang issue in hyper #1550

Merged

demoray closed this as completed in #1550 Jan 5, 2024

krishanjmistry mentioned this issue Jan 17, 2024

Hang in execute_command when connecting to server Azure/azure-kusto-rust#30

Closed

skyzh mentioned this issue Nov 21, 2024

fix(pageserver): preempt and retry azure list operation neondatabase/neon#9840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyper client hang in container #1549

Hyper client hang in container #1549

tierriminator commented Jan 5, 2024

demoray commented Jan 5, 2024

Hyper client hang in container #1549

Hyper client hang in container #1549

Comments

tierriminator commented Jan 5, 2024

demoray commented Jan 5, 2024