Increase Retry Time on Data Sources #454

stmcallister · 2022-02-03T00:43:20Z

Some users are still running in to 429: Too Many Requests errors from the PagerDuty API, specifically on Data Sources. There are also a few resources with increased retry times. And a few API calls in the Escalation Policy Resource and the Service Dependency Resource that Retry was added.

drastawi · 2022-02-03T03:39:53Z

pagerduty/resource_pagerduty_service_dependency.go

+	// listServiceRelationships by calling get dependencies using the serviceDependency.DependentService.ID
+	retryErr := resource.Retry(5*time.Minute, func() *resource.RetryError {
+		if dependencies, _, err := client.ServiceDependencies.GetServiceDependenciesForType(dependency.DependentService.ID, dependency.DependentService.Type); err != nil {
+			if isErrCode(err, 404) || isErrCode(err, 500) || isErrCode(err, 429) {


@stmcallister, would be a good idea to centralize the logic of what is a returnable error? I'm not sure if it is worth it to retry on 400 at all for example. I can see the benefit of increasing the retry to 5min for 429 and maybe 503/504 errors, but it might substantially hurt the experience if there is a real 404 or 400.

@drastawi Thanks for the feedback! The tricky part is that in some race condition situations we do want to retry on a 404 🤔 Good point on the centralizing of the logic.

drastawi · 2022-02-04T04:35:43Z

pagerduty/resource_pagerduty_service_dependency.go

 		if _, _, err = client.ServiceDependencies.DisassociateServiceDependencies(&input); err != nil {
-			if isErrCode(err, 404) {
+			if isErrCode(err, 404) || isErrCode(err, 429) {
 				return resource.RetryableError(err)


@stmcallister I am a bit concerned the end users might have to wait for 5 min to find out a resource already exists or had already been deleted manually which are not that uncommon.

One solution that seems fairly easy to implement at a glance would be to add a max timeout to the retryable error as an extra optional parameter

Suggested change

return resource.RetryableError(err)

if isErrCode(err, 429) {

return resource.RetryableError(err)

} else if isErrCode(err, 404) {

return resource.RetryableErrorWithMaxTimeout(err, 15 * time.Second)

}

dobs

Overall LGTM once merge conflicts are resolved -- can give the 👍 once fixed.

To @drastawi's feedback there's some room for potential consolidation and adjustment but I'd be fine with tackling that in a separate pass, possibly including moving values into the Config.

dnck · 2022-03-04T14:50:47Z

Over here, we're still seeing 429s with PD provider 3.2.0...

viktor-f3 · 2022-12-06T10:21:38Z

We are seeing TF provider erroring when trying to query service dependencies. Provider shouldn't crash in our opinion and should keep retrying with exponential backoffs. I've looked through provider code and noticed that there are 5 minute delays between retries which seems to be a bit excessive. We manage thousands of Terraform resources and our plans often fail after an hour (largely due to retries on querying service dependencies)

increase retry on data sources

9eceed6

drastawi reviewed Feb 3, 2022

View reviewed changes

drastawi reviewed Feb 4, 2022

View reviewed changes

Merge branch 'master' into extend-retry

5931668

dobs reviewed Feb 9, 2022

View reviewed changes

fix conflicts

c4169b1

dobs approved these changes Feb 10, 2022

View reviewed changes

stmcallister merged commit a0d1acf into PagerDuty:master Feb 10, 2022

drastawi mentioned this pull request Jul 14, 2022

Too many unnecessarily long timeouts on known errors #544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase Retry Time on Data Sources #454

Increase Retry Time on Data Sources #454

stmcallister commented Feb 3, 2022

drastawi Feb 3, 2022

stmcallister Feb 4, 2022

drastawi Feb 4, 2022

dobs left a comment

dnck commented Mar 4, 2022

viktor-f3 commented Dec 6, 2022

-				return resource.RetryableError(err)
+                        if isErrCode(err, 429) {
+				return resource.RetryableError(err)
+			} else if  isErrCode(err, 404) {
+				return resource.RetryableErrorWithMaxTimeout(err, 15 * time.Second)
+			}

Increase Retry Time on Data Sources #454

Increase Retry Time on Data Sources #454

Conversation

stmcallister commented Feb 3, 2022

drastawi Feb 3, 2022

Choose a reason for hiding this comment

stmcallister Feb 4, 2022

Choose a reason for hiding this comment

drastawi Feb 4, 2022

Choose a reason for hiding this comment

dobs left a comment

Choose a reason for hiding this comment

dnck commented Mar 4, 2022

viktor-f3 commented Dec 6, 2022