Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaps: retry on certain failure modes, such as 504s #3601

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

alichay
Copy link
Contributor

@alichay alichay commented Jun 5, 2024

Change Summary

What and Why: All flaps operations will retry if the error is known to be transient. This should improve reliability - especially in situations like fly deploy where one hiccup can (currently) stop an entire deploy.

How: Adds a wrapper in flapsutil over *flaps.Client that implements retrying on certain failure modes, like 504. NewClientWithOptions now returns the generic FlapsClient so that we can return the wrapper type instead of a raw *flaps.Client.


Documentation

  • Fresh Produce
  • In superfly/docs, or asked for help from docs team
  • n/a

alichay added 11 commits June 5, 2024 15:08
These functions allow retrying machine operations like machine creation, though there's no logic yet for determining which errors are *actually* transient, so behavior is still the same
backoff modifies the Backoff state object while using it, so having one globally makes me uncomfortable
except for ephemeral machines - i'm cooking up a different strategy to do this so this is more of a check-in than anything
Since Flaps has been abstracted to an Interface in flyctl, we can add a retry layer to it by just wrapping the existing client with *few* changes made elsewhere

This is just a drop-in reliability improvement
@rugwirobaker
Copy link
Contributor

Retrying POST/PATCH my yield unexpected results cause subsequent requests might to go to a different flaps and flyd among other issues. It's basically not Idempotent. https://flyio.discourse.team/t/flaps-what-status-codes-can-we-retry/5060/2

@alichay
Copy link
Contributor Author

alichay commented Jun 6, 2024

Retrying POST/PATCH my yield unexpected results cause subsequent requests might to go to a different flaps and flyd among other issues. It's basically not Idempotent. https://flyio.discourse.team/t/flaps-what-status-codes-can-we-retry/5060/2

ah. that's a huge bummer. I'm going to try to see if I can get this working for GET requests then, and maybe afterwards come up with a different strategy for fly deploy machine creation.

I wonder if some operations, like SetMetadata, could be retried anyway? it's setting a named value, so even if it were called multiple times there shouldn't be any side effects...

Going to have to follow up to see whether all of these are safe to retry, but I feel like they *should* be?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants