25 max_retries is really long for a default #1209

kmosher · 2017-07-22T03:34:18Z

I've been debugging an issue the past few days where operations on lambda functions were hanging forever. The real root cause eventually turned out to be a crappy corpware DNS resolver returning non-RFC compliant replies that the golang net library didn't like. However, this still turned up terraform issues that significantly delayed me in getting to the root of the problem.

25 max_retries as default seems like way too many. If the request itself takes any significant amount of time to fail, you can have a resource stuck spinning for over an hour. Especially because the default retry logic in the SDK is exponential.

The retry logic from aws-sdk-go/aws/client/default_retryer is effectively 2**min(retryCount,13) * randint(30) + 30) (in milliseconds). (Some of the numbers change to scale up faster if the error is AWS telling the client to throttle)

The start of that function is pretty gentle. Assuming an average of 15 on the rand call, it's not until the 8th attempt that you are spending more than a second between calls. But after that, it scales quickly, as exponentials are wont to do. And by the 13th call, you've hit the scaling cap and spend an average of 2 minutes between calls. Summing up over the 25 retries, you can expect to wait an average of 26 mins on a failing request plus whatever time it takes all 25 requests to fail (which can be substantial when the failure involves a request timeout).

Full sequence of average wait times between each request (in seconds):

[0.045, 0.06, 0.09, 0.15, 0.27, 
 0.51, 0.99, 1.95, 3.87, 7.71, 
15.39, 30.75, 61.47, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91]

10 (total ~15s sleeping) to 12 (total ~60s sleeping) as max_retries seems like a better default. Maybe 14 tops where we can expect terraform to wait an average of 4 minutes before bailing.

The above might be okay, but the provider doesn't configure the SDK to log the reasons why it's retrying, so you need to wait the full timeout before the underlying error is bubbled up.

Terraform Version

0.9.11 and 0.10-1rc1 (w/ aws 1.2)

Example HCL

I used the verbatim HCL from https://www.terraform.io/docs/providers/aws/r/lambda_function.html, but any old HCL will do

Steps to Reproduce

terraform refresh/apply
While, terraform is running, do something to induce connection errors. For instance, I set my nameserver to 127.0.0.1

The text was updated successfully, but these errors were encountered:

bflad · 2019-06-19T04:57:29Z

The fix to lower the threshold for DNS resolution errors (#4459) was previously released in version 1.18.0 of the AWS provider and has been available in all releases since. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

In general though, the provider max_retries argument can be tuned as appropriate for your environments. We have found over the years that operators of larger environments appreciate the larger default value and in many scenarios the default AWS Go SDK handling for retries is correct (e.g. retrying throttling errors).

If there are specific cases where the retry logic is not working as expected, please feel free to create a new Bug Report issue filling out the relevant details and we will further triage. Thanks. 👍

ghost · 2019-11-03T15:24:59Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

kmosher mentioned this issue Jul 22, 2017

Configure SDK to log extra debug details on request errors #1210

Merged

grubernaut added the bug Addresses a defect in current functionality. label Jul 24, 2017

radeksimko added enhancement Requests to existing resources that expand the functionality or scope. and removed bug Addresses a defect in current functionality. labels Jan 28, 2018

bflad added the provider Pertains to the provider itself, rather than any interaction with AWS. label Jan 29, 2018

jckuester mentioned this issue Mar 6, 2019

New parameter to set max-retries for api calls jckuester/awsweeper#35

Merged

bflad closed this as completed Jun 19, 2019

ghost locked and limited conversation to collaborators Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25 max_retries is really long for a default #1209

25 max_retries is really long for a default #1209

kmosher commented Jul 22, 2017 •

edited

Loading

bflad commented Jun 19, 2019

ghost commented Nov 3, 2019

25 max_retries is really long for a default #1209

25 max_retries is really long for a default #1209

Comments

kmosher commented Jul 22, 2017 • edited Loading

Terraform Version

Example HCL

Steps to Reproduce

bflad commented Jun 19, 2019

ghost commented Nov 3, 2019

kmosher commented Jul 22, 2017 •

edited

Loading