Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better recovery #3709

Merged
merged 60 commits into from
Jul 22, 2024
Merged

Better recovery #3709

merged 60 commits into from
Jul 22, 2024

Conversation

billyb2
Copy link
Member

@billyb2 billyb2 commented Jul 5, 2024

Change Summary

What and Why:
This PR focuses on trying to change how machine updates work. Instead of failing whenever one machine isn't able to update, we instead try again, making sure to not redo any work that we'd already completed. We do this by keeping track of 'app state', which is basically a list of all the machines we have in our app. This works great since it includes machine states, machine configs, and mounts.

Sometimes, certain errors aren't worth retrying. For example, if we aren't able to acquire a lease on your machine because it's already being held, then it's more likely than not that retrying won't do any good. In that case, we return an 'unrecoverable error', and choose to completely fail the deploy. In certain cases, we could try rolling back to how we were (for example, if health checks fail we could rollback), though I'm saving that implementation for a future PR

How:
We will continue to try to finish a deploy a few times, until either:
a. the deploy completes successfully, meaning we transitioned from the old app state to the new app state
b. we've exhausted a certain number of attempts to complete a deploy without success
c. we encountered an unrecoverable error, meaning that flyctl doesn't think it's worth attempting a retry

In the first case, flyctl is much more resilient to intermittent platform errors, which can happen pretty often. In the second case, it's likely that there's something wrong with either the user's environment or the platform itself. If it's the former, then @rugwirobaker 's work to move orchestration logic into a new fly machine will help with those cases in the future. If it's the latter, we're working on setting up alerting to try and learn about these cases sooner. In the third case, the goal in the future is to add suggestions to the user to help recover from these sorts of issues.

The bulk of the code is in internal/command/deploy/plan.go

Related to:
https://flyio.discourse.team/t/deployments-roadmap-discussion/6326
https://flyio.discourse.team/t/deployments-roadmap-redux/6451
https://flyio.discourse.team/t/deployment-recoverability/6441

Documentation

  • Fresh Produce
  • In superfly/docs, or asked for help from docs team
  • n/a

@billyb2 billyb2 force-pushed the better_recovery branch 3 times, most recently from 85ff796 to 59f8c2a Compare July 11, 2024 19:26
@billyb2 billyb2 marked this pull request as ready for review July 11, 2024 19:26
@billyb2 billyb2 force-pushed the better_recovery branch 2 times, most recently from 5409794 to 9a38fa1 Compare July 14, 2024 17:32
@billyb2
Copy link
Member Author

billyb2 commented Jul 15, 2024

^ btw i'm working on adding unit testing to plan.go now, I'll make sure to have it up and reviewed before I merge

@billyb2 billyb2 force-pushed the better_recovery branch 5 times, most recently from f774d3e to 37a16f3 Compare July 16, 2024 23:21
Copy link
Member

@dangra dangra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbjohnson @michaeldwan this PR touches the hot path of Fly deployments, a small screw up here can make a big wave. It deserves as many eyes as it can get on it 👀 🙏

@benbjohnson benbjohnson self-requested a review July 17, 2024 15:57
The goal of this is to make recovering from individual machine update
failures easier, so that the entire deployment can succeed
The really cool part about these changes are it didn't actually take
much change to how machine updates work. All I really do is call
updateMachines, but with our original state instead of the state that we
initially wanted to go to.

I made a bug fix to lease clearing, since in some edge cases we weren't correctly clearing them.

I made a bug fix to machine waits, since we were sometimes causing an
infinite loop from not giving the Wait function time to set the waitErr, whoops
This was dumb to not do before, since obviously the state wouldn't be current
We should only attempt the rollback functionality when we initially try
to update machines, not on every rollback after that obviously
We need to avoid deleting unmanaged machines
wasn't testing that path correctly. also, i added a quick optimization
to waitForMachineState to avoid an unnecessary API call (just checking
the machine state right then and there)
i forgot that lm isn't a pointer, so we need to check
entry.leasableMachine for newly created machines
This covers most of the major functions and the major places there could
be issues
also make sure to print to stderr
also add some tests for updateorcreatemachine
md.warnAboutListenAddress requires this function
Also refresh the lease in the background
we can't start machines if we have a lease acquired
i also added back the original deployment code, and use that if
deploy-restries is set to 0. That way, we can more slowly roll this out
without risking breaking user apps if there's some terrible bug
Users can still set deploy-retries to whatever value they'd like, however.
@billyb2 billyb2 merged commit 1bdffec into master Jul 22, 2024
34 checks passed
@billyb2 billyb2 deleted the better_recovery branch July 22, 2024 18:52
Comment on lines +322 to +324
if machine.LeaseNonce == "" {
sl.LogStatus(statuslogger.StatusRunning, fmt.Sprintf("Waiting for job %s", machine.ID))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why we show Waiting for job in this case? It seems like we should still say Acquiring lease for if the machine isn't current leased, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great question, this seems like a bug. I feel like we should check if LeaseNonce != "", and if that's the case, we just return (since we already have an acquired lease)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lease nonce could be set if someone else has it too though, which which case we still want to wait. I'm going to remove this LogStatus though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants