iam: separate backoffs, add jitter, and increase for conflicts #10786

stbenjam · 2021-12-22T16:00:13Z

Within our project, we can have multiple instances of terraform
modifying iam policies, and in many cases these instances are kicked off
at exactly the same time. We're running into errors where we exceed the
backoff max (which in reality is 16 seconds, not 30). Also, Google
reccomends that backoffs contain jitter [1] to prevent clients from
retrying all at once in synchronized waves.

This change (1) separates the 3 distinct backoffs used in the iam policy
read-modify-write cycle, (2) introduces jitter on each retry, and (3)
increases the conflict max backoff to 5 minutes.

[1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff

Within our project, we can have multiple instances of terraform modifying iam policies, and in many cases these instances are kicked off at exactly the same time. We're running into errors where we exceed the backoff max (which in reality is 16 seconds, not 30). Also, Google reccomends that backoffs contain jitter [1] to prevent clients from retrying all at once in synchronized waves. This change (1) separates the 3 distinct backoffs used in the iam policy read-modify-write cycle, (2) introduces jitter on each retry, and (3) increases the conflict max backoff to 5 minutes. [1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff

stbenjam · 2022-01-06T13:22:25Z

@rileykarson Any chance you could have a look at this PR? 🙏

rileykarson · 2022-01-07T19:58:44Z

@slevenick do you mind taking a look? I think you're more familiar with the IAM framework than I am.

slevenick · 2022-01-07T20:32:20Z

What symptoms are you seeing that you think jitter will fix?

My understanding is that Terraform is already introducing small differences in time when dispatching the processes that handle individual resources, and everything within an individual resource is handled serially

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

stbenjam · 2022-01-08T20:31:16Z

What symptoms are you seeing that you think jitter will fix?

We tend to kick off a bunch of separate terraform applies at the same time in OpenShift CI (as many as 30, I think). We could certainly solve our problem at a higher abstraction layer if you'd prefer to leave the current code as is.

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

Even though 16s is ~5 retries, it's in a relatively short window when we have so many concurrently running applies.

slevenick · 2022-01-10T21:59:48Z

What symptoms are you seeing that you think jitter will fix?

We tend to kick off a bunch of separate terraform applies at the same time in OpenShift CI (as many as 30, I think). We could certainly solve our problem at a higher abstraction layer if you'd prefer to leave the current code as is.

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

Even though 16s is ~5 retries, it's in a relatively short window when we have so many concurrently running applies.

If you're kicking the runs off as separate terraform applies then adding jitter at this level shouldn't have an impact? Different runs of terraform apply are already running in separate processes, and if it's in CI I'd guess that adds an extra jitter automatically as machines connect, startup, read current state etc.

It seems like the issue comes down to conflicting changes being made during this process. Is there a reason these resources are managed in different Terraform configs & handled by different CI jobs at the same time? It feels like a solution could be to manage all similar resources in a single file, as we have implemented batching for IAM resources for just this sort of situation

github-actions · 2022-05-22T02:27:41Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions bot requested a review from rileykarson December 22, 2021 16:00

stbenjam force-pushed the fix-retries branch from 100cd5f to 1ad1a77 Compare December 22, 2021 16:02

stbenjam mentioned this pull request Dec 22, 2021

[release-3.27] iam: separate backoffs, add jitter, and increase for conflicts openshift/terraform-providers-terraform-provider-google#2

Closed

rileykarson requested review from slevenick and removed request for rileykarson January 7, 2022 19:58

stbenjam closed this Apr 22, 2022

stbenjam deleted the fix-retries branch April 22, 2022 00:40

github-actions bot locked as resolved and limited conversation to collaborators May 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iam: separate backoffs, add jitter, and increase for conflicts #10786

iam: separate backoffs, add jitter, and increase for conflicts #10786

stbenjam commented Dec 22, 2021

stbenjam commented Jan 6, 2022

rileykarson commented Jan 7, 2022

slevenick commented Jan 7, 2022

stbenjam commented Jan 8, 2022

slevenick commented Jan 10, 2022

github-actions bot commented May 22, 2022

iam: separate backoffs, add jitter, and increase for conflicts #10786

iam: separate backoffs, add jitter, and increase for conflicts #10786

Conversation

stbenjam commented Dec 22, 2021

stbenjam commented Jan 6, 2022

rileykarson commented Jan 7, 2022

slevenick commented Jan 7, 2022

stbenjam commented Jan 8, 2022

slevenick commented Jan 10, 2022

github-actions bot commented May 22, 2022