Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of committed inconsistent/corrupt state #8097

Open
notnoop opened this issue Jun 2, 2020 · 4 comments
Open

Handling of committed inconsistent/corrupt state #8097

notnoop opened this issue Jun 2, 2020 · 4 comments

Comments

@notnoop
Copy link
Contributor

notnoop commented Jun 2, 2020

Nomad FSM handling is sometimes strict in handling log entries by insuring that some invariants always hold, and fail early if it notices inconsistencies or invalid state.

While it shows good intention, the state does get into a corrupt state due to random bugs and it makes recovery hard.

We studied a cluster running 0.8 which upgraded to 0.10. The cluster ended up with some corrupt state possibly due to #4299 and summary jobs being out of sync.

These had cascading effects in few places:

  • The job summary being out of sync resulted into scheduler not being able to process updates to that job - failures lines included something like the following:
2020-05-20T23:33:18.760Z [ERROR] nomad.fsm: ApplyPlan failed: error="error updating job summary: unable to find task group in the job summary: xxx"
2020-05-20T23:33:18.020Z [ERROR] worker: failed to submit plan for evaluation: eval_id=49941cf9-2126-f25b-f718-e6d7ad2db28e error="rpc error: error updating job summary: unable to find task group in the job summary: xxx"
2020-05-20T23:32:34.300Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: error updating job summary: unable to find task group in the job summary: xxx"

This was reported as well in #5939 .

  • If an job batch deregister eval included a job already, the job de-registration would fail and the entire batch fails to be removed. This happened in this cluster with the following log lines:
2020-05-21T15:21:59.197Z [ERROR] nomad.fsm: deregistering job failed: job=""<ns: "default", id: "REDACTED/periodic-1590018900">" error="job not found"
2020-05-21T15:21:59.198Z [ERROR] nomad.fsm: DeleteJob failed: error="job not found"
2020-05-21T15:21:59.062Z [ERROR] nomad.fsm: DeleteDeployment failed: error="deployment not found"

In both of these cases, strict enforcement of invariants exacerbated the situation and made cluster recovery harder. We can consider having automated processes (e.g. if job summary is invalid recompute it, deletion should idempotent and deleting already deleted job shouldn't result into an error).

In the upgrade scenario above, it's unclear to me how the invalid state came to be. My guess is that it was due to bugs in 0.8 (like the ones linked above) but the upgrade to 0.10 exacerbated the situation.

Should scan the FSM/planner checks and ensure that we can recover once an invalid state is already committed to cluster.

@evandam
Copy link

evandam commented Jul 13, 2020

Hey @notnoop, I'm seeing similar errors after upgrading from Nomad 0.11.3 to 0.12.0.

Shortly after upgrading, we started seeing sporadic errors error updating job summary: unable to find task group in the job summary: job when planning periodic jobs, but with no meaningful logs found on the Nomad at the time.

The jobs are already registered and no changes were made to them before and after the cluster upgrade.

Eventually retrying nomad job plan ... works, but still seems to happen periodically.

@IrinaBron
Copy link

IrinaBron commented Jul 14, 2020

Hey @notnoop I'm running into the same issue @evandam is as well.

@evandam
Copy link

evandam commented Jul 15, 2020

Dropping some additional information/context around when I was seeing this issue.

We started seeing the error updating job summary errors with job plans after restarting the Nomad service in our environment, before we actually upgraded from Nomad 0.11.3 to 0.12.0.

In addition, we saw a "ghost allocation" that did not get stopped after any new deployments since the restart and had to be stopped by hand.

nomad job status prod-platform-core
ID            = prod-platform-core
Name          = prod-platform-core
Submit Date   = 2020-07-15T07:47:30-07:00
Type          = service
Priority      = 50
Datacenters   = prod-usw2-prod1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
app         0       5         13       0       1061      16

Latest Deployment
ID          = 807dcd4f
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
app         true         true      12       5         12      12       0          2020-07-15T15:00:45Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
4dfef693  e09a4e06  app         1418     run      running   35m7s ago   33m56s ago
e6af8836  096b7d22  app         1418     run      running   35m10s ago  34m8s ago
44e53e09  e09a4e06  app         1418     run      running   36m11s ago  35m6s ago
d697c3ad  80f9f3ec  app         1418     run      running   36m11s ago  35m8s ago
a6834b5a  37ba0b54  app         1418     run      running   36m11s ago  35m12s ago
2082765e  37ba0b54  app         1418     run      running   36m11s ago  35m3s ago
06ce06c0  80f9f3ec  app         1418     run      running   36m11s ago  35m7s ago
236000f6  f339639f  app         1418     run      running   37m11s ago  36m15s ago
c8b46c76  e09a4e06  app         1418     run      running   37m11s ago  36m11s ago
66baa990  37ba0b54  app         1418     run      running   37m11s ago  36m16s ago
8080c584  096b7d22  app         1418     run      running   37m11s ago  36m16s ago
eedb181f  80f9f3ec  app         1418     run      running   37m11s ago  36m12s ago
37546e28  e09a4e06  app         1417     stop     complete  14h16m ago  34m56s ago
e6cbe54b  096b7d22  app         1417     stop     complete  14h16m ago  35m4s ago
a7208d5d  37ba0b54  app         1417     stop     complete  14h17m ago  36m ago
3ce508d3  f339639f  app         1417     stop     complete  14h17m ago  36m5s ago
97ea4707  80f9f3ec  app         1417     stop     complete  14h17m ago  36m5s ago
4a2c4e9f  e09a4e06  app         1417     stop     complete  14h17m ago  36m5s ago
5c920064  80f9f3ec  app         1417     stop     complete  14h17m ago  36m2s ago
91729b2d  80f9f3ec  app         1417     stop     complete  14h18m ago  36m5s ago
84faf102  37ba0b54  app         1417     stop     complete  14h18m ago  36m5s ago
0d5acd27  e09a4e06  app         1417     stop     complete  14h18m ago  36m ago
58eb0325  37ba0b54  app         1417     stop     complete  14h18m ago  36m5s ago
686828bd  096b7d22  app         1417     stop     complete  14h18m ago  36m5s ago
e7c0e967  f339639f  app         1385     run      running   5d16h ago   5d16h ago

Might be better off as a separate issue, but worth mentioning!

@notnoop
Copy link
Contributor Author

notnoop commented Jul 15, 2020

Thank you very much for reporting more data. I'll need to dig into this further and will follow up with some clarifying questions. I'm very surprised that the error occurred for a non-parameterized/periodic job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants