Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote.Resolver timeout doesn't work as expected #1180

Open
Kern-- opened this issue Mar 27, 2023 · 0 comments
Open

remote.Resolver timeout doesn't work as expected #1180

Kern-- opened this issue Mar 27, 2023 · 0 comments

Comments

@Kern--
Copy link
Contributor

Kern-- commented Mar 27, 2023

While trying to write an integration test for poor network conditions in the soci-snapshotter, we noticed that the timeout logic didn't work the way we expected.

In particular, we found that when a connection timed out, no retries were performed and the request was abandonded after the first failure. We also noticed that no matter what we set the default timeout to, the request would always block for 30 seconds. We expected that a connection timeout would retry like any other http failure and that the default timeout would be the timeout for each request.

Root cause

The root cause appears to be the way that retries were added to the stargz-snapshotter:

client := rhttp.NewClient()
client.Logger = nil // disable logging every request
tr := client.StandardClient()
if h.RequestTimeoutSec >= 0 {
if h.RequestTimeoutSec == 0 {
tr.Timeout = defaultRequestTimeoutSec * time.Second
} else {
tr.Timeout = time.Duration(h.RequestTimeoutSec) * time.Second
}
} // h.RequestTimeoutSec < 0 means "no timeout"

The structure of tr after this block is the following (ignoring irrelevant fields):

http.Client {
    Timeout: <configured timeout>
    Transport: rttp.Transport {
        Client: rhttp.Client {
            HTTPClient: http.Client {}
        }
    }
}

What appears to be happening is that once the outer http.Client timeout is reached, the request context is cancelled which prevents the rhttp.Client from doing any retries.

What also seems to be happening is that the inner http.Client is not selecting over the request context at all because it doesn't have a timeout. The 30s before it exits appears to be a dial timeout configured deep in the default client used by go-retryablehttp https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go#L30.

I think the solution here is to move the timeout to the inner http client so that it defines the per-request timeout rather than the total request timeout. The total request timeout is then configurable based on the retry policy.

Related Issues

There are a couple of related issues that might be worth tracking too:

  1. The default timeout is hardcoded to 30s https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L30
  2. Registry mirrors have a configurable timeout https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L52, but the main host itself doesn't https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L65
  3. remote.httpFetcher follows redirects when it shouldn't https://github.com/containerd/stargz-snapshotter/blob/main/fs/remote/resolver.go#L478. This is because the rhttp.RoundTripper.Roundtrip eventually calls the inner http.Client.Do. I don't understand the reasons for not following redirects here, so I'm not sure if this is an issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant