-
Acknowledgements
Describe the bugI have two versions of an application running on AWS EKS along with istio sidecar containers. The application often boots up faster than istio sidecar, hence it usually hits several "connection refused" from AWS APIs (and also from other APIs) while istio sidecar is initializing. The older version of the application uses older versions of the SDK modules and runs flawlessly withstanding those "connection refused" errors. The new version of the application uses newer versions of the SDK modules and is consistently failing to withstand the "connection refused" errors. It produces lots of errors like this:
Two interesting notes: 1 - The istio sidecar takes only a couple of seconds to boot (way sooner than SDK default max 20-sec backoff). So I would not expect the SDK give up and report "exceeded maximum number of attempts, 3". This led me to suspect the SDK is not retrying properly (with 20-sec backoff) on "connection refused" errors. 2 - The application uses the default retrier, it does NOT perform any customization. The older application with GOOD results uses these versions:
I first found this issue with a newer version of the application running these versions:
Then I tried to update the newer application to latest SDK module versions, but newer versions produce the same effect (choking on transient "connection refused" errors). Expected BehaviorThe new application with newer SDK versions would properly withstand "connection refused" errors by retrying, much like the previous application with previous SDK versions did. Current BehaviorNew application with recent SDK fails with AWS APIs at boot, reporting:
Reproduction StepsI dont know how to properly setup a lab environment where the SDK hits "connection refused" for some seconds before succeeding. Possible SolutionI suppose the application code could be tweaked to re-create the SDK clients explicitly, rebuilding the retry attempts from outside the SDK, but that seems a lot of duplicated effort, since the SDK provides a builtin retrier that should work. Additional Information/ContextNo response AWS Go SDK V2 Module Versions UsedI attempted to upgrade to these versions but got the same result.
Compiler and Version usedgo version go1.22.1 linux/amd64 Operating System and versionLinux 5.10 on amd64 on EC2 on AWS EKS |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Hi @udhos , It's not clear which SDK version are you upgrading from and to? From the error provided it seems like your application only retries 3 times: You can also enable your SDK request and response logs. That would help print the timestamp of the outgoing retry attempts, and see if that custom 40 second backoff is applied or not: cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("us-east-1"), config.WithClientLogMode(aws.LogRequestWithBody | aws.LogResponseWithBody | aws.LogRetries)) Thanks, |
Beta Was this translation helpful? Give feedback.
-
@RanVaknin Hi I increased max attempts to 6 and max backoff to 40s right after opening this ticket, in order to persuade the sdk to retry longer. It did work. The new app is now able to withstand the initial connection refused errors. I did not try other values because delivering the app to the staging environment once is a time-consuming process. I am now guessing that perhaps the default 3 attempt limit might result in a tight time window for that kind of error caused by the booting sidecar. |
Beta Was this translation helpful? Give feedback.
-
All this means is we tried the configured number of attempts and never succeeded - so it's as you said, it appears to just be a tight window in your situation. Note that the default max 20-second backoff across n max attempts does not mean we deterministically scale up to 20 seconds on the nth (final) attempt - it's calculated exponentially, I don't think in practice that default max will actually be reached with the default 3 attempts. In short, the retry behavior we're seeing here is correct on spec (and hasn't changed recently). |
Beta Was this translation helpful? Give feedback.
-
Hi @udhos. We recently ran into the same issue. After digging a bit, we found that it was better to configure Istio to wait to start our app until network egress was available, using the |
Beta Was this translation helpful? Give feedback.
Hi @udhos ,
It's not clear which SDK version are you upgrading from and to?
From the error provided it seems like your application only retries 3 times:
exceeded maximum number of attempts, 3
, but your the retryer override specifies 6 retry attempts. This hints at the fact that perhaps this retryer is not being used by the constructed client.You can also enable your SDK request and response logs. That would help print the timestamp of the outgoing retry attempts, and see if that custom 40 second backoff is applied or not: