Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25 max_retries is really long for a default #1209

Closed
kmosher opened this issue Jul 22, 2017 · 2 comments
Closed

25 max_retries is really long for a default #1209

kmosher opened this issue Jul 22, 2017 · 2 comments
Labels
enhancement Requests to existing resources that expand the functionality or scope. provider Pertains to the provider itself, rather than any interaction with AWS.

Comments

@kmosher
Copy link
Contributor

kmosher commented Jul 22, 2017

I've been debugging an issue the past few days where operations on lambda functions were hanging forever. The real root cause eventually turned out to be a crappy corpware DNS resolver returning non-RFC compliant replies that the golang net library didn't like. However, this still turned up terraform issues that significantly delayed me in getting to the root of the problem.

  1. 25 max_retries as default seems like way too many. If the request itself takes any significant amount of time to fail, you can have a resource stuck spinning for over an hour. Especially because the default retry logic in the SDK is exponential.

The retry logic from aws-sdk-go/aws/client/default_retryer is effectively 2**min(retryCount,13) * randint(30) + 30) (in milliseconds). (Some of the numbers change to scale up faster if the error is AWS telling the client to throttle)

The start of that function is pretty gentle. Assuming an average of 15 on the rand call, it's not until the 8th attempt that you are spending more than a second between calls. But after that, it scales quickly, as exponentials are wont to do. And by the 13th call, you've hit the scaling cap and spend an average of 2 minutes between calls. Summing up over the 25 retries, you can expect to wait an average of 26 mins on a failing request plus whatever time it takes all 25 requests to fail (which can be substantial when the failure involves a request timeout).

Full sequence of average wait times between each request (in seconds):

[0.045, 0.06, 0.09, 0.15, 0.27, 
 0.51, 0.99, 1.95, 3.87, 7.71, 
15.39, 30.75, 61.47, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91]

10 (total ~15s sleeping) to 12 (total ~60s sleeping) as max_retries seems like a better default. Maybe 14 tops where we can expect terraform to wait an average of 4 minutes before bailing.

  1. The above might be okay, but the provider doesn't configure the SDK to log the reasons why it's retrying, so you need to wait the full timeout before the underlying error is bubbled up.

Terraform Version

0.9.11 and 0.10-1rc1 (w/ aws 1.2)

Example HCL

I used the verbatim HCL from https://www.terraform.io/docs/providers/aws/r/lambda_function.html, but any old HCL will do

Steps to Reproduce

  1. terraform refresh/apply
  2. While, terraform is running, do something to induce connection errors. For instance, I set my nameserver to 127.0.0.1
@grubernaut grubernaut added the bug Addresses a defect in current functionality. label Jul 24, 2017
@radeksimko radeksimko added enhancement Requests to existing resources that expand the functionality or scope. and removed bug Addresses a defect in current functionality. labels Jan 28, 2018
@bflad bflad added the provider Pertains to the provider itself, rather than any interaction with AWS. label Jan 29, 2018
@bflad
Copy link
Contributor

bflad commented Jun 19, 2019

The fix to lower the threshold for DNS resolution errors (#4459) was previously released in version 1.18.0 of the AWS provider and has been available in all releases since. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

In general though, the provider max_retries argument can be tuned as appropriate for your environments. We have found over the years that operators of larger environments appreciate the larger default value and in many scenarios the default AWS Go SDK handling for retries is correct (e.g. retrying throttling errors).

If there are specific cases where the retry logic is not working as expected, please feel free to create a new Bug Report issue filling out the relevant details and we will further triage. Thanks. 👍

@bflad bflad closed this as completed Jun 19, 2019
@ghost
Copy link

ghost commented Nov 3, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Requests to existing resources that expand the functionality or scope. provider Pertains to the provider itself, rather than any interaction with AWS.
Projects
None yet
Development

No branches or pull requests

4 participants