`remote.Resolver` timeout doesn't work as expected #1180

Kern-- · 2023-03-27T21:35:59Z

While trying to write an integration test for poor network conditions in the soci-snapshotter, we noticed that the timeout logic didn't work the way we expected.

In particular, we found that when a connection timed out, no retries were performed and the request was abandonded after the first failure. We also noticed that no matter what we set the default timeout to, the request would always block for 30 seconds. We expected that a connection timeout would retry like any other http failure and that the default timeout would be the timeout for each request.

Root cause

The root cause appears to be the way that retries were added to the stargz-snapshotter:

stargz-snapshotter/service/resolver/registry.go

Lines 67 to 76 in 0c9f876

    
           client := rhttp.NewClient() 
        
           client.Logger = nil // disable logging every request 
        
           tr := client.StandardClient() 
        
           if h.RequestTimeoutSec >= 0 { 
        
           	if h.RequestTimeoutSec == 0 { 
        
           		tr.Timeout = defaultRequestTimeoutSec * time.Second 
        
           	} else { 
        
           		tr.Timeout = time.Duration(h.RequestTimeoutSec) * time.Second 
        
           	} 
        
           } // h.RequestTimeoutSec < 0 means "no timeout"

The structure of tr after this block is the following (ignoring irrelevant fields):

http.Client {
    Timeout: <configured timeout>
    Transport: rttp.Transport {
        Client: rhttp.Client {
            HTTPClient: http.Client {}
        }
    }
}

What appears to be happening is that once the outer http.Client timeout is reached, the request context is cancelled which prevents the rhttp.Client from doing any retries.

What also seems to be happening is that the inner http.Client is not selecting over the request context at all because it doesn't have a timeout. The 30s before it exits appears to be a dial timeout configured deep in the default client used by go-retryablehttp https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go#L30.

I think the solution here is to move the timeout to the inner http client so that it defines the per-request timeout rather than the total request timeout. The total request timeout is then configurable based on the retry policy.

Related Issues

There are a couple of related issues that might be worth tracking too:

The default timeout is hardcoded to 30s https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L30
Registry mirrors have a configurable timeout https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L52, but the main host itself doesn't https://github.com/containerd/stargz-snapshotter/blob/main/service/resolver/registry.go#L65
remote.httpFetcher follows redirects when it shouldn't https://github.com/containerd/stargz-snapshotter/blob/main/fs/remote/resolver.go#L478. This is because the rhttp.RoundTripper.Roundtrip eventually calls the inner http.Client.Do. I don't understand the reasons for not following redirects here, so I'm not sure if this is an issue.

The text was updated successfully, but these errors were encountered:

Kern-- mentioned this issue Mar 27, 2023

Make timeout per-request #1181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`remote.Resolver` timeout doesn't work as expected #1180

`remote.Resolver` timeout doesn't work as expected #1180

Kern-- commented Mar 27, 2023

remote.Resolver timeout doesn't work as expected #1180

remote.Resolver timeout doesn't work as expected #1180

Comments

Kern-- commented Mar 27, 2023

Root cause

Related Issues

`remote.Resolver` timeout doesn't work as expected #1180

`remote.Resolver` timeout doesn't work as expected #1180