Better recovery #3709

billyb2 · 2024-07-05T22:26:20Z

Change Summary

What and Why:
This PR focuses on trying to change how machine updates work. Instead of failing whenever one machine isn't able to update, we instead try again, making sure to not redo any work that we'd already completed. We do this by keeping track of 'app state', which is basically a list of all the machines we have in our app. This works great since it includes machine states, machine configs, and mounts.

Sometimes, certain errors aren't worth retrying. For example, if we aren't able to acquire a lease on your machine because it's already being held, then it's more likely than not that retrying won't do any good. In that case, we return an 'unrecoverable error', and choose to completely fail the deploy. In certain cases, we could try rolling back to how we were (for example, if health checks fail we could rollback), though I'm saving that implementation for a future PR

How:
We will continue to try to finish a deploy a few times, until either:
a. the deploy completes successfully, meaning we transitioned from the old app state to the new app state
b. we've exhausted a certain number of attempts to complete a deploy without success
c. we encountered an unrecoverable error, meaning that flyctl doesn't think it's worth attempting a retry

In the first case, flyctl is much more resilient to intermittent platform errors, which can happen pretty often. In the second case, it's likely that there's something wrong with either the user's environment or the platform itself. If it's the former, then @rugwirobaker 's work to move orchestration logic into a new fly machine will help with those cases in the future. If it's the latter, we're working on setting up alerting to try and learn about these cases sooner. In the third case, the goal in the future is to add suggestions to the user to help recover from these sorts of issues.

The bulk of the code is in internal/command/deploy/plan.go

Related to:
https://flyio.discourse.team/t/deployments-roadmap-discussion/6326
https://flyio.discourse.team/t/deployments-roadmap-redux/6451
https://flyio.discourse.team/t/deployment-recoverability/6441

Documentation

Fresh Produce
In superfly/docs, or asked for help from docs team
n/a

internal/command/deploy/plan.go

billyb2 · 2024-07-15T17:41:28Z

^ btw i'm working on adding unit testing to plan.go now, I'll make sure to have it up and reviewed before I merge

internal/command/deploy/machines_deploymachinesapp.go

internal/command/deploy/plan.go

dangra

@benbjohnson @michaeldwan this PR touches the hot path of Fly deployments, a small screw up here can make a big wave. It deserves as many eyes as it can get on it 👀 🙏

The goal of this is to make recovering from individual machine update failures easier, so that the entire deployment can succeed

The really cool part about these changes are it didn't actually take much change to how machine updates work. All I really do is call updateMachines, but with our original state instead of the state that we initially wanted to go to. I made a bug fix to lease clearing, since in some edge cases we weren't correctly clearing them. I made a bug fix to machine waits, since we were sometimes causing an infinite loop from not giving the Wait function time to set the waitErr, whoops

This was dumb to not do before, since obviously the state wouldn't be current

We should only attempt the rollback functionality when we initially try to update machines, not on every rollback after that obviously

We need to avoid deleting unmanaged machines

wasn't testing that path correctly. also, i added a quick optimization to waitForMachineState to avoid an unnecessary API call (just checking the machine state right then and there)

i forgot that lm isn't a pointer, so we need to check entry.leasableMachine for newly created machines

This covers most of the major functions and the major places there could be issues

also make sure to print to stderr

also add some tests for updateorcreatemachine

md.warnAboutListenAddress requires this function

Also refresh the lease in the background

we can't start machines if we have a lease acquired

i also added back the original deployment code, and use that if deploy-restries is set to 0. That way, we can more slowly roll this out without risking breaking user apps if there's some terrible bug

Users can still set deploy-retries to whatever value they'd like, however.

internal/command/deploy/deploy.go

btoews · 2024-08-02T15:09:32Z

internal/command/deploy/plan.go

+			if machine.LeaseNonce == "" {
+				sl.LogStatus(statuslogger.StatusRunning, fmt.Sprintf("Waiting for job %s", machine.ID))
+			}


Do you remember why we show Waiting for job in this case? It seems like we should still say Acquiring lease for if the machine isn't current leased, right?

that's a great question, this seems like a bug. I feel like we should check if LeaseNonce != "", and if that's the case, we just return (since we already have an acquired lease)

The lease nonce could be set if someone else has it too though, which which case we still want to wait. I'm going to remove this LogStatus though.

billyb2 force-pushed the better_recovery branch 3 times, most recently from 85ff796 to 59f8c2a Compare July 11, 2024 19:26

billyb2 marked this pull request as ready for review July 11, 2024 19:26

billyb2 force-pushed the better_recovery branch 2 times, most recently from 5409794 to 9a38fa1 Compare July 14, 2024 17:32

billyb2 commented Jul 15, 2024

View reviewed changes

internal/command/deploy/plan.go Outdated Show resolved Hide resolved

billyb2 force-pushed the better_recovery branch from 999034f to 8f2244c Compare July 15, 2024 15:08