Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iam: separate backoffs, add jitter, and increase for conflicts #10786

Closed
wants to merge 1 commit into from

Conversation

stbenjam
Copy link

Within our project, we can have multiple instances of terraform
modifying iam policies, and in many cases these instances are kicked off
at exactly the same time. We're running into errors where we exceed the
backoff max (which in reality is 16 seconds, not 30). Also, Google
reccomends that backoffs contain jitter [1] to prevent clients from
retrying all at once in synchronized waves.

This change (1) separates the 3 distinct backoffs used in the iam policy
read-modify-write cycle, (2) introduces jitter on each retry, and (3)
increases the conflict max backoff to 5 minutes.

[1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff

Within our project, we can have multiple instances of terraform
modifying iam policies, and in many cases these instances are kicked off
at exactly the same time. We're running into errors where we exceed the
backoff max (which in reality is 16 seconds, not 30). Also, Google
reccomends that backoffs contain jitter [1] to prevent clients from
retrying all at once in synchronized waves.

This change (1) separates the 3 distinct backoffs used in the iam policy
read-modify-write cycle, (2) introduces jitter on each retry, and (3)
increases the conflict max backoff to 5 minutes.

[1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff
@stbenjam
Copy link
Author

stbenjam commented Jan 6, 2022

@rileykarson Any chance you could have a look at this PR? 🙏

@rileykarson rileykarson requested review from slevenick and removed request for rileykarson January 7, 2022 19:58
@rileykarson
Copy link
Collaborator

@slevenick do you mind taking a look? I think you're more familiar with the IAM framework than I am.

@slevenick
Copy link
Collaborator

What symptoms are you seeing that you think jitter will fix?

My understanding is that Terraform is already introducing small differences in time when dispatching the processes that handle individual resources, and everything within an individual resource is handled serially

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

@stbenjam
Copy link
Author

stbenjam commented Jan 8, 2022

What symptoms are you seeing that you think jitter will fix?

We tend to kick off a bunch of separate terraform applies at the same time in OpenShift CI (as many as 30, I think). We could certainly solve our problem at a higher abstraction layer if you'd prefer to leave the current code as is.

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

Even though 16s is ~5 retries, it's in a relatively short window when we have so many concurrently running applies.

@slevenick
Copy link
Collaborator

What symptoms are you seeing that you think jitter will fix?

We tend to kick off a bunch of separate terraform applies at the same time in OpenShift CI (as many as 30, I think). We could certainly solve our problem at a higher abstraction layer if you'd prefer to leave the current code as is.

Bumping the timeout to longer than 16 seconds probably makes sense, but how are you hitting conflict errors ~5 times in a row?

Even though 16s is ~5 retries, it's in a relatively short window when we have so many concurrently running applies.

If you're kicking the runs off as separate terraform applies then adding jitter at this level shouldn't have an impact? Different runs of terraform apply are already running in separate processes, and if it's in CI I'd guess that adds an extra jitter automatically as machines connect, startup, read current state etc.

It seems like the issue comes down to conflicting changes being made during this process. Is there a reason these resources are managed in different Terraform configs & handled by different CI jobs at the same time? It feels like a solution could be to manage all similar resources in a single file, as we have implemented batching for IAM resources for just this sort of situation

@stbenjam stbenjam closed this Apr 22, 2022
@stbenjam stbenjam deleted the fix-retries branch April 22, 2022 00:40
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants