Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Retry Time on Data Sources #454

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

stmcallister
Copy link
Contributor

Some users are still running in to 429: Too Many Requests errors from the PagerDuty API, specifically on Data Sources. There are also a few resources with increased retry times. And a few API calls in the Escalation Policy Resource and the Service Dependency Resource that Retry was added.

// listServiceRelationships by calling get dependencies using the serviceDependency.DependentService.ID
retryErr := resource.Retry(5*time.Minute, func() *resource.RetryError {
if dependencies, _, err := client.ServiceDependencies.GetServiceDependenciesForType(dependency.DependentService.ID, dependency.DependentService.Type); err != nil {
if isErrCode(err, 404) || isErrCode(err, 500) || isErrCode(err, 429) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stmcallister, would be a good idea to centralize the logic of what is a returnable error? I'm not sure if it is worth it to retry on 400 at all for example. I can see the benefit of increasing the retry to 5min for 429 and maybe 503/504 errors, but it might substantially hurt the experience if there is a real 404 or 400.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drastawi Thanks for the feedback! The tricky part is that in some race condition situations we do want to retry on a 404 🤔 Good point on the centralizing of the logic.

if _, _, err = client.ServiceDependencies.DisassociateServiceDependencies(&input); err != nil {
if isErrCode(err, 404) {
if isErrCode(err, 404) || isErrCode(err, 429) {
return resource.RetryableError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stmcallister I am a bit concerned the end users might have to wait for 5 min to find out a resource already exists or had already been deleted manually which are not that uncommon.

One solution that seems fairly easy to implement at a glance would be to add a max timeout to the retryable error as an extra optional parameter

Suggested change
return resource.RetryableError(err)
if isErrCode(err, 429) {
return resource.RetryableError(err)
} else if isErrCode(err, 404) {
return resource.RetryableErrorWithMaxTimeout(err, 15 * time.Second)
}

Copy link
Contributor

@dobs dobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM once merge conflicts are resolved -- can give the 👍 once fixed.

To @drastawi's feedback there's some room for potential consolidation and adjustment but I'd be fine with tackling that in a separate pass, possibly including moving values into the Config.

@stmcallister stmcallister merged commit a0d1acf into PagerDuty:master Feb 10, 2022
@dnck
Copy link

dnck commented Mar 4, 2022

Over here, we're still seeing 429s with PD provider 3.2.0...

@viktor-f3
Copy link

We are seeing TF provider erroring when trying to query service dependencies. Provider shouldn't crash in our opinion and should keep retrying with exponential backoffs. I've looked through provider code and noticed that there are 5 minute delays between retries which seems to be a bit excessive. We manage thousands of Terraform resources and our plans often fail after an hour (largely due to retries on querying service dependencies)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants