You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to write an integration test for poor network conditions in the soci-snapshotter, we noticed that the timeout logic didn't work the way we expected.
In particular, we found that when a connection timed out, no retries were performed and the request was abandonded after the first failure. We also noticed that no matter what we set the default timeout to, the request would always block for 30 seconds. We expected that a connection timeout would retry like any other http failure and that the default timeout would be the timeout for each request.
Root cause
The root cause appears to be the way that retries were added to the stargz-snapshotter:
What appears to be happening is that once the outer http.Client timeout is reached, the request context is cancelled which prevents the rhttp.Client from doing any retries.
What also seems to be happening is that the inner http.Client is not selecting over the request context at all because it doesn't have a timeout. The 30s before it exits appears to be a dial timeout configured deep in the default client used by go-retryablehttp https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go#L30.
I think the solution here is to move the timeout to the inner http client so that it defines the per-request timeout rather than the total request timeout. The total request timeout is then configurable based on the retry policy.
Related Issues
There are a couple of related issues that might be worth tracking too:
While trying to write an integration test for poor network conditions in the soci-snapshotter, we noticed that the timeout logic didn't work the way we expected.
In particular, we found that when a connection timed out, no retries were performed and the request was abandonded after the first failure. We also noticed that no matter what we set the default timeout to, the request would always block for 30 seconds. We expected that a connection timeout would retry like any other http failure and that the default timeout would be the timeout for each request.
Root cause
The root cause appears to be the way that retries were added to the stargz-snapshotter:
stargz-snapshotter/service/resolver/registry.go
Lines 67 to 76 in 0c9f876
The structure of
tr
after this block is the following (ignoring irrelevant fields):What appears to be happening is that once the outer
http.Client
timeout is reached, the request context is cancelled which prevents therhttp.Client
from doing any retries.What also seems to be happening is that the inner
http.Client
is not selecting over the request context at all because it doesn't have a timeout. The 30s before it exits appears to be a dial timeout configured deep in the default client used by go-retryablehttp https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go#L30.I think the solution here is to move the timeout to the inner http client so that it defines the per-request timeout rather than the total request timeout. The total request timeout is then configurable based on the retry policy.
Related Issues
There are a couple of related issues that might be worth tracking too:
remote.httpFetcher
follows redirects when it shouldn't https://github.com/containerd/stargz-snapshotter/blob/main/fs/remote/resolver.go#L478. This is because therhttp.RoundTripper.Roundtrip
eventually calls the innerhttp.Client.Do
. I don't understand the reasons for not following redirects here, so I'm not sure if this is an issue.The text was updated successfully, but these errors were encountered: