-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase Retry Time on Data Sources #454
Conversation
// listServiceRelationships by calling get dependencies using the serviceDependency.DependentService.ID | ||
retryErr := resource.Retry(5*time.Minute, func() *resource.RetryError { | ||
if dependencies, _, err := client.ServiceDependencies.GetServiceDependenciesForType(dependency.DependentService.ID, dependency.DependentService.Type); err != nil { | ||
if isErrCode(err, 404) || isErrCode(err, 500) || isErrCode(err, 429) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stmcallister, would be a good idea to centralize the logic of what is a returnable error? I'm not sure if it is worth it to retry on 400 at all for example. I can see the benefit of increasing the retry to 5min for 429 and maybe 503/504 errors, but it might substantially hurt the experience if there is a real 404 or 400.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drastawi Thanks for the feedback! The tricky part is that in some race condition situations we do want to retry on a 404 🤔 Good point on the centralizing of the logic.
if _, _, err = client.ServiceDependencies.DisassociateServiceDependencies(&input); err != nil { | ||
if isErrCode(err, 404) { | ||
if isErrCode(err, 404) || isErrCode(err, 429) { | ||
return resource.RetryableError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stmcallister I am a bit concerned the end users might have to wait for 5 min to find out a resource already exists or had already been deleted manually which are not that uncommon.
One solution that seems fairly easy to implement at a glance would be to add a max timeout to the retryable error as an extra optional parameter
return resource.RetryableError(err) | |
if isErrCode(err, 429) { | |
return resource.RetryableError(err) | |
} else if isErrCode(err, 404) { | |
return resource.RetryableErrorWithMaxTimeout(err, 15 * time.Second) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM once merge conflicts are resolved -- can give the 👍 once fixed.
To @drastawi's feedback there's some room for potential consolidation and adjustment but I'd be fine with tackling that in a separate pass, possibly including moving values into the Config
.
Over here, we're still seeing 429s with PD provider 3.2.0... |
We are seeing TF provider erroring when trying to query service dependencies. Provider shouldn't crash in our opinion and should keep retrying with exponential backoffs. I've looked through provider code and noticed that there are 5 minute delays between retries which seems to be a bit excessive. We manage thousands of Terraform resources and our plans often fail after an hour (largely due to retries on querying service dependencies) |
Some users are still running in to
429: Too Many Requests
errors from the PagerDuty API, specifically on Data Sources. There are also a few resources with increased retry times. And a few API calls in the Escalation Policy Resource and the Service Dependency Resource that Retry was added.