Failed node's allocations not rescheduled #2958

ebdb · 2017-08-03T18:04:29Z

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu 16.04
2 nodes running Consul/Nomad Server (this is a dev environment and we want to easily test loss of quorum scenarios, hence the even number)
4 nodes running Consul/Nomad Client

Issue

Allocations on failed clients are not rescheduled on other clients. When a node reconnects to the cluster, the Jobs running on that node are stopped. Only system Jobs are restarted on that node. These last two behaviors are expected. However, in v0.5.6, allocations on failed clients were rescheduled on other clients.

Reproduction steps

Run a service Job.
Disconnect one of the nodes the Job was allocated on from the network.
Monitor Nomad for new allocations and Consul for new registrations.

Other Notes

Initially, this environment was upgraded from 0.5.6 to 0.6.0. To rule out an upgrade issue, I stopped all of the Nomad services, cleaned out their data folders and started a new cluster. The issue persists on the new cluster.

dadgar · 2017-08-03T18:15:08Z

Hey @ebdb,

Would you mind sharing the server logs when this happens and if you can the output of nomad status <job> after killing the node and nomad detects it as down?

dadgar · 2017-08-03T18:22:44Z

For example when I kill nodes, I see the server detect the TTL and replace the nodes allocations else where (killed 3 out of 6 nodes and see 3 ttls):

2017/08/03 18:19:17.097147 [WARN] nomad.heartbeat: node 'f78619f4-d83e-a379-db1f-cda5fe59edd4' TTL expired <<<<<<<<<<<<<<<<,
    2017/08/03 18:19:17.098803 [DEBUG] worker: dequeued evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:17.099600 [DEBUG] sched: <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>: Total changes: (place 2) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "cache": (place 2) (inplace 0) (destructive 0) (stop 2) (migrate 0) (ignore 8) (canary 0)
    2017/08/03 18:19:17.101926 [DEBUG] worker: submitted plan for evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:17.102754 [DEBUG] sched: <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>: setting status to complete
    2017/08/03 18:19:17.103277 [DEBUG] worker: updated evaluation <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>
    2017/08/03 18:19:17.103556 [DEBUG] worker: ack for evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:21.752237 [WARN] nomad.heartbeat: node 'cc9ae5db-d01e-682f-94ce-c3478286dee0' TTL expired <<<<<<<<<<<<
    2017/08/03 18:19:21.753612 [DEBUG] worker: dequeued evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:21.754071 [DEBUG] sched: <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>: Total changes: (place 2) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "cache": (place 2) (inplace 0) (destructive 0) (stop 2) (migrate 0) (ignore 8) (canary 0)
    2017/08/03 18:19:21.756427 [DEBUG] worker: submitted plan for evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:21.756473 [DEBUG] sched: <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>: setting status to complete
    2017/08/03 18:19:21.756679 [DEBUG] worker: updated evaluation <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>
    2017/08/03 18:19:21.756793 [DEBUG] worker: ack for evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:24.148579 [WARN] nomad.heartbeat: node '720ba291-a064-6747-69af-808817371fa5' TTL expired <<<<<<<<<<<<<<<<

Further nomad status marks them as lost which shows that it handled the node going down:

$ nomad status e
ID            = example
Name          = example
Submit Date   = 08/03/17 18:17:49 UTC
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         10       0       0         6

Latest Deployment
ID          = 490d6243
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       10       16      0        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
6d225fd0  67ceb1bb  cache       0        run      running  08/03/17 18:19:24 UTC
31ff74f0  559b6fc1  cache       0        run      running  08/03/17 18:19:24 UTC
48d1a318  e01201a2  cache       0        run      running  08/03/17 18:19:21 UTC
22fbd8a4  720ba291  cache       0        stop     lost     08/03/17 18:19:21 UTC
0dcb69e2  cc9ae5db  cache       0        stop     lost     08/03/17 18:19:17 UTC
3a44961b  03d001fc  cache       0        run      running  08/03/17 18:19:17 UTC
0d5e8719  67ceb1bb  cache       0        run      running  08/03/17 18:17:49 UTC
494d497a  67ceb1bb  cache       0        run      running  08/03/17 18:17:49 UTC
53849776  e01201a2  cache       0        run      running  08/03/17 18:17:49 UTC
198a509e  559b6fc1  cache       0        run      running  08/03/17 18:17:49 UTC
91b5be6d  f78619f4  cache       0        stop     lost     08/03/17 18:17:49 UTC
957d4d86  03d001fc  cache       0        run      running  08/03/17 18:17:49 UTC
a4c359b0  f78619f4  cache       0        stop     lost     08/03/17 18:17:49 UTC
b40127c3  720ba291  cache       0        stop     lost     08/03/17 18:17:49 UTC
f60d0a62  559b6fc1  cache       0        run      running  08/03/17 18:17:49 UTC
f70da716  cc9ae5db  cache       0        stop     lost     08/03/17 18:17:49 UTC

I wonder if you just do not have enough resources left over the remaining clients for Nomad to be able to reschedule the work.

ebdb · 2017-08-03T18:53:20Z

The status for the service job:

ID            = localization
Name          = localization
Submit Date   = 08/03/17 11:21:08 PDT
Type          = service
Priority      = 50
Datacenters   = us-corp-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
query       0       0         2        0       0         0
update      0       0         1        0       0         1

Latest Deployment
ID          = 51c59e39
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
query       2        2       0        2
update      2        2       0        2

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
447dc733  bda07614  update      0        stop     lost     08/03/17 11:21:08 PDT
4808a9de  e5333bf8  query       0        run      running  08/03/17 11:21:08 PDT
86f4d140  b84ac250  update      0        run      running  08/03/17 11:21:08 PDT
c06935bc  a2049b3b  query       0        run      running  08/03/17 11:21:08 PDT

And the logs from the cluster leader:

2017/08/03 11:28:52.227492 [WARN] nomad.heartbeat: node 'bda07614-2c53-e3d6-11bd-80a72a3af1fd' TTL expired
2017/08/03 11:28:52.236659 [DEBUG] worker: dequeued evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.236909 [DEBUG] sched: <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>: allocs: (place 4) (update 0) (migrate 0) (stop 0) (ignore 6) (lost 2)
2017/08/03 11:28:52.243230 [DEBUG] worker: submitted plan for evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.243470 [DEBUG] sched: <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>: setting status to complete
2017/08/03 11:28:52.251200 [DEBUG] worker: updated evaluation <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>
2017/08/03 11:28:52.251417 [DEBUG] worker: ack for evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.251574 [DEBUG] worker: dequeued evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a
2017/08/03 11:28:52.251738 [DEBUG] sched: <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>: allocs: (place 2) (update 0) (migrate 0) (stop 0) (ignore 3) (lost 1)
2017/08/03 11:28:52.284653 [DEBUG] worker: submitted plan for evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a
2017/08/03 11:28:52.285901 [DEBUG] sched: <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>: setting status to complete
2017/08/03 11:28:52.289766 [DEBUG] worker: updated evaluation <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>
2017/08/03 11:28:52.289948 [DEBUG] worker: ack for evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a

The scheduler logs information about the two system jobs (fabio, authentication), but does not even mention the service job.

As far as resource are concerned, all of the hosts have resources to spare, such as this node:

CPU            Memory          Disk            IOPS
2000/6602 MHz  1.0 GiB/15 GiB  2.9 GiB/48 GiB  0/0

Or this node:

CPU            Memory          Disk            IOPS
2100/6598 MHz  1.1 GiB/15 GiB  3.2 GiB/47 GiB  0/0

The task is requesting:

resources {
	cpu = 100
	memory = 64
	network {
		mbits = 1
		port "app" {}
	}
}

dadgar · 2017-08-03T20:56:12Z

@ebdb Interesting. Thanks for following up. Does this happen reliably for you, as in every time a node is drained with a service?

dadgar · 2017-08-03T21:29:32Z

@ebdb If this happens again can you paste the output of curl http://127.0.0.1:4646/v1/job/<job_id>/evaluations

dadgar · 2017-08-03T22:42:11Z

@ebdb Is it also possible to get more of the logs and the output of nomad inspect <job> as well?

dadgar · 2017-08-03T23:23:15Z

@ebdb Feel free to ignore all requests! Figured it out

This PR allows the scheduler to replace lost allocations even if the job has a failed or paused deployment. The prior behavior was confusing to users. Fixes #2958

dadgar · 2017-08-04T01:03:49Z

Hey just to update this issue. Nomad was handling the node failures for service jobs but it was not replacing allocations when the job's most recent deployment was "Failed".

ebdb · 2017-08-04T15:39:18Z

I noticed something that seems strange to me. When I have a job with all of its allocations running and passing consul health checks, I see this as the deployment's status:

ID          = f0e70f2f
Job ID      = localization
Job Version = 0
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
query       2        2       0        2
update      2        2       0        2

Does that seem right to you? Should I open a new issue for that?

dadgar · 2017-08-04T16:50:57Z

@ebdb The update stanza defines a min_healthy_time and a healthy_deadline. The Consul checks have to be passing within that window. Further if the tasks don't all start or they are restarting the alloc is considered unhealthy. I would like to make debugging the reason it went unhealthy easier in an upcoming release but likely this is not a bug.

ebdb · 2017-08-04T17:20:59Z

The health checks are passing on their first run and stay that way, so I am not sure why the deployment wouldn't pass right away. Switching health_check to task_states allows the deployment to succeed, but isn't the most desirable situation. I'll keep digging to see if there is something in our environment or job files that might be causing the issue.

Just to rule it out, is there a minimum consul version this works with?

dadgar · 2017-08-04T17:24:33Z

@ebdb You can set your log level to trace and look at the client logs to see why it is failing in the mean time. All even remotely recent versions of Consul should be fine!

github-actions · 2022-12-11T02:17:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added type/bug stage/needs-investigation labels Aug 3, 2017

dadgar added a commit that referenced this issue Aug 4, 2017

Lost allocs replaced even if deployment failed

66c59b0

This PR allows the scheduler to replace lost allocations even if the job has a failed or paused deployment. The prior behavior was confusing to users. Fixes #2958

dadgar mentioned this issue Aug 4, 2017

Lost allocs replaced even if deployment failed #2961

Merged

dadgar closed this as completed in #2961 Aug 4, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed node's allocations not rescheduled #2958

Failed node's allocations not rescheduled #2958

ebdb commented Aug 3, 2017

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017

ebdb commented Aug 3, 2017

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017 •

edited

Loading

dadgar commented Aug 3, 2017

dadgar commented Aug 4, 2017

ebdb commented Aug 4, 2017

dadgar commented Aug 4, 2017

ebdb commented Aug 4, 2017

dadgar commented Aug 4, 2017 •

edited

Loading

github-actions bot commented Dec 11, 2022

Failed node's allocations not rescheduled #2958

Failed node's allocations not rescheduled #2958

Comments

ebdb commented Aug 3, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Other Notes

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017

ebdb commented Aug 3, 2017

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017

dadgar commented Aug 3, 2017 • edited Loading

dadgar commented Aug 3, 2017

dadgar commented Aug 4, 2017

ebdb commented Aug 4, 2017

dadgar commented Aug 4, 2017

ebdb commented Aug 4, 2017

dadgar commented Aug 4, 2017 • edited Loading

github-actions bot commented Dec 11, 2022

dadgar commented Aug 3, 2017 •

edited

Loading

dadgar commented Aug 4, 2017 •

edited

Loading