-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of committed inconsistent/corrupt state #8097
Comments
Hey @notnoop, I'm seeing similar errors after upgrading from Nomad 0.11.3 to 0.12.0. Shortly after upgrading, we started seeing sporadic errors The jobs are already registered and no changes were made to them before and after the cluster upgrade. Eventually retrying |
Dropping some additional information/context around when I was seeing this issue. We started seeing the In addition, we saw a "ghost allocation" that did not get stopped after any new deployments since the restart and had to be stopped by hand.
|
Thank you very much for reporting more data. I'll need to dig into this further and will follow up with some clarifying questions. I'm very surprised that the error occurred for a non-parameterized/periodic job! |
Nomad FSM handling is sometimes strict in handling log entries by insuring that some invariants always hold, and fail early if it notices inconsistencies or invalid state.
While it shows good intention, the state does get into a corrupt state due to random bugs and it makes recovery hard.
We studied a cluster running 0.8 which upgraded to 0.10. The cluster ended up with some corrupt state possibly due to #4299 and summary jobs being out of sync.
These had cascading effects in few places:
This was reported as well in #5939 .
In both of these cases, strict enforcement of invariants exacerbated the situation and made cluster recovery harder. We can consider having automated processes (e.g. if job summary is invalid recompute it, deletion should idempotent and deleting already deleted job shouldn't result into an error).
In the upgrade scenario above, it's unclear to me how the invalid state came to be. My guess is that it was due to bugs in 0.8 (like the ones linked above) but the upgrade to 0.10 exacerbated the situation.
Should scan the FSM/planner checks and ensure that we can recover once an invalid state is already committed to cluster.
The text was updated successfully, but these errors were encountered: