Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad tries to call RecoverTask on an allocation it has GC'd minutes ago; restarts tasks it shouldn't #10901

Closed
benbuzbee opened this issue Jul 15, 2021 · 7 comments · Fixed by #10907

Comments

@benbuzbee
Copy link
Contributor

Nomad version

1.0.4

Issue

We have a custom task driver and notice from our logs the following behavior. All of these logs have annotations alloc_id=190deedc-2a2e-9687-f70c-d8fbfead9da2. Our logs are marked with a + and nomad's own logs are marked with a ~

+ 2021-07-13T20:29:21.468Z Nomad requested to stop task "190deedc-2a2e-9687-f70c-d8fbfead9da2/browser/16d4fde1"
+ 2021-07-13T20:29:23.475Z	RPC call "StopTask" succeeded in 2.006652261s
~ 2021-07-13T20:29:24.403Z	reattached plugin process exited
~ 2021-07-13T20:29:24.588Z	plugin exited
~ 2021-07-13T20:29:25.169Z	marking allocation for GC
~ 2021-07-13T20:29:25.507Z	task run loop exiting

...
~ 2021-07-13T20:45:35.458Z	Starting Nomad agent...
+ 2021-07-13T20:45:35.486Z	RPC call "RecoverTask" failed in 635.538µs: error creating controller:
~ 2021-07-13T20:45:35.496Z	marking allocation for GC

As you can see, Nomad stops the thing at 20:29:25 and marks it for GC, but still tries to recover it when the agent starts up at 20:45:35, and marks it for GC again

It seems like strange behavior that it tries to recover something it has stopped and destroyed.
I also think this problem means a restart/reschedule = 0 job gets restarted anyway see repro for example

Reproduction steps

  1. Use the example job nomad job init but set restarts = 0 and reschedules { attempts = 0, unlimited = false }
$ ./bin/nomad job run example.nomad 
==> Monitoring evaluation "5eea32f3"
    Evaluation triggered by job "example"
==> Monitoring evaluation "5eea32f3"
    Evaluation within deployment: "76699497"
    Allocation "efd1d8b9" created: node "a6367934", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5eea32f3" finished with status "complete"
  1. Make the job fail
./bin/nomad alloc signal efd
2021-07-14T23:53:38Z  Not Restarting  Policy allows no restarts
2021-07-14T23:53:38Z  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2021-07-14T23:53:38Z  Signaling       Task being sent a signal
2021-07-14T23:53:21Z  Started         Task started by client
2021-07-14T23:53:20Z  Task Setup      Building Task Directory
2021-07-14T23:53:20Z  Received        Task received by client
  1. Stop the nomad client/server process
    Note from the logs:
2021-07-14T23:53:38.787Z [INFO]  client.gc: marking allocation for GC: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5
  1. Start the nomad client/server process with the same data dir as before

The job is running even though it failed already and has 0/0 set as its policies:

vagrant@browser-devbox:~/core/src/third_party/cloudchamber-nomad-driver$ ./bin/nomad status 
ID       Type     Priority  Status   Submit Date
example  service  50        running  2021-07-14T23:53:20Z

It has forgotten about the times it was killed apparently

Recent Events:
Time                  Type        Description
2021-07-14T23:54:29Z  Started     Task started by client
2021-07-14T23:54:28Z  Task Setup  Building Task Directory
2021-07-14T23:54:28Z  Received    Task received by client
2021-07-14T23:53:20Z  Task Setup  Building Task Directory
2021-07-14T23:53:20Z  Received    Task received by client

It force GC'd the old one for some reason then restarted it?

2021-07-14T23:54:26.661Z [INFO]  client.gc: garbage collecting allocation: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5 reason="forced collection"
2021-07-14T23:54:28.059Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=efd1d8b9-c51d-469d-e862-bbd642ba71c5 task=redis

You cant see this from nomads logs alone, I can see from our custom driver logs that before all of this the second nomad instance called RecoverTask

Expected Result

  1. It doesnt try to recovertask after it destroyed it and gc'd it
  2. It DEFINITELY doesn't restart it

Actual Result

  1. It does
  2. It definitely does :)

Attachments

first-client-run.log
second-client-run.log
example.nomad.txt

@notnoop
Copy link
Contributor

notnoop commented Jul 15, 2021

I'm afraid that I didn't reproduce this issue or discern the triggering condition so far. Can you try to reproduce it again and send the tarball of the data dirs when the client/server is stopped the first time, and after the alloc got restarted?

I'll continue inspecting the code for clues as well.

@notnoop notnoop moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jul 15, 2021
@benbuzbee
Copy link
Contributor Author

Here's a tar file of a repro
https://drive.google.com/drive/folders/18j5sSVTb9ZYYw-3T61jCXLLwCQzSENwA?usp=sharing

I dont know if this is the full repro of RecoverTask being called but the shortest way with that tar to repro something that i think is broken is:

./nomad agent -config config.hcl
./nomad job run example.nomad
./nomad alloc signal <ALLOC>
./nomad status <ALLOC> (verifiy stopped and will not restart)
ctrl+c (see data-dir-first for /tmp/nomad output at this step, see first.log for log output )
./nomad agent -config config.hcl
./nomad status <alloc> (should be dead but will be running again)
ctrl+c (see data-dir-second for /tmp/nomad output at this step, see second.log for log output)

This is using the latest nomad release (also packaged in the tar)

@notnoop
Copy link
Contributor

notnoop commented Jul 17, 2021

Thank you @benbuzbee for the details. Also, I got to reproduce it with a Linux single server/client cluster! And I plan to have a fix soon!

The problem is a variant of #5911, that disproportionately affect single server/client clusters. I'd be curious to know if it happened in production for you.

The issue is that the client pulled an outdated/stale state view while the server was applying raft logs. The client fetches its allocations with AllowStale=true, recognizing the always present-delay in propagating info and assuming that index is monotonically increasing. However, if the server is just starting up and replying raft logs, the client may see an outdated view and start tasks that should not start. That also explains why the client "forgot" its killing of the task.

In the latest repro logs, I noticed the following:

  1. The 143c2382 allocation was updated to pending at raft index 18, running at index 19, and failed at index 25.
  2. When the client start up the second time, RecoverTask is invoked for the task and it will ignore it and properly mark it for GC.
  3. Later on the client gets a stale view of allocs from index 16 and 20 before it was failed, so the client starts it. The client then overwrites the alloc events. When the server catches up to the events, client ignores the failed update because JobModifyIndex hasn't changed (leaving this detail out for now).

The following are the relevant log lines. Note particularly the leader election event and client's "updated allocations" log lines and their indexes.

    2021-07-15T20:53:56.456Z [INFO]  client.gc: marking allocation for GC: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20
    2021-07-15T20:53:56.457Z [DEBUG] client: updated allocations: index=1 total=0 pulled=0 filtered=0
[...]
    2021-07-15T20:53:56.457Z [INFO]  client.gc: garbage collecting allocation: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20 reason="forced collection"
    2021-07-15T20:53:56.490Z [DEBUG] client.gc: alloc garbage collected: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20
    2021-07-15T20:53:57.442Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
    2021-07-15T20:53:57.442Z [INFO]  nomad.raft: entering candidate state: node="Node at 192.168.1.35:4647 [Candidate]" term=4
    2021-07-15T20:53:57.499Z [DEBUG] nomad.raft: votes: needed=1
    2021-07-15T20:53:57.499Z [DEBUG] nomad.raft: vote granted: from=192.168.1.35:4647 term=4 tally=1
    2021-07-15T20:53:57.499Z [INFO]  nomad.raft: election won: tally=1
    2021-07-15T20:53:57.499Z [INFO]  nomad.raft: entering leader state: leader="Node at 192.168.1.35:4647 [Leader]"
    2021-07-15T20:53:57.499Z [INFO]  nomad: cluster leadership acquired
    2021-07-15T20:53:57.519Z [DEBUG] client: updated allocations: index=16 total=1 pulled=1 filtered=0
    2021-07-15T20:53:57.519Z [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=0
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=0 errors=0
    2021-07-15T20:53:57.540Z [DEBUG] client: updated allocations: index=20 total=1 pulled=0 filtered=1
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2021-07-15T20:53:57.540Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
    2021-07-15T20:53:57.540Z [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=143c2382-942c-aabf-0368-0c4eaee95d20 task=redis
    2021-07-15T20:53:57.540Z [DEBUG] client: updated allocations: index=25 total=1 pulled=0 filtered=1
    2021-07-15T20:53:57.541Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1

@benbuzbee
Copy link
Contributor Author

benbuzbee commented Jul 20, 2021

I'd be curious to know if it happened in production for you.

From the logs in my original post we at least saw RecoverTask being called for a previously Destroy'd task, and one that has restarts set to 0 to boot. I am not very certain it is the exact same root cause, but I'm happy with the plan of merging your fix for this then re-evaluating if I still see it :)

Appreciate the quick and deep look! Excited to get these new improvements into our cluster!

@notnoop
Copy link
Contributor

notnoop commented Jul 20, 2021

From the logs in my original post we at least saw RecoverTask being called for a previously Destroy'd task, and one that has restarts set to 0 to boot. I am not very certain it is the exact same root cause

This RecoverTask invocation is not surprising to me. The client keeps a local state info for an alloc until the server has GCed it. RecoverTask aims to load local state from data dir and reattach to any existing running processes; and while arguably we should handle dead tasks maybe a bit differently, we haven't witnessed the need for this optimizations/special-casing yet.

Let us know how your experience goes!

Nomad - Community Issues Triage automation moved this from In Progress to Done Jul 20, 2021
@benbuzbee
Copy link
Contributor Author

Ok so I understand from that that it's expected RecoverTask gets called and should just return an error since the task is dead, not necessarily a problem just makes it hard for our custom driver to figure out the difference between a real error in RecoverTask, and an inevitable error because nomad tried to Recover something it already asked us to Destroy. Not sure the best way to filter out the noise from error monitoring

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants