Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed node's allocations not rescheduled #2958

Closed
ebdb opened this issue Aug 3, 2017 · 13 comments
Closed

Failed node's allocations not rescheduled #2958

ebdb opened this issue Aug 3, 2017 · 13 comments

Comments

@ebdb
Copy link

ebdb commented Aug 3, 2017

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu 16.04
2 nodes running Consul/Nomad Server (this is a dev environment and we want to easily test loss of quorum scenarios, hence the even number)
4 nodes running Consul/Nomad Client

Issue

Allocations on failed clients are not rescheduled on other clients. When a node reconnects to the cluster, the Jobs running on that node are stopped. Only system Jobs are restarted on that node. These last two behaviors are expected. However, in v0.5.6, allocations on failed clients were rescheduled on other clients.

Reproduction steps

  1. Run a service Job.
  2. Disconnect one of the nodes the Job was allocated on from the network.
  3. Monitor Nomad for new allocations and Consul for new registrations.

Other Notes

Initially, this environment was upgraded from 0.5.6 to 0.6.0. To rule out an upgrade issue, I stopped all of the Nomad services, cleaned out their data folders and started a new cluster. The issue persists on the new cluster.

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

Hey @ebdb,

Would you mind sharing the server logs when this happens and if you can the output of nomad status <job> after killing the node and nomad detects it as down?

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

For example when I kill nodes, I see the server detect the TTL and replace the nodes allocations else where (killed 3 out of 6 nodes and see 3 ttls):

2017/08/03 18:19:17.097147 [WARN] nomad.heartbeat: node 'f78619f4-d83e-a379-db1f-cda5fe59edd4' TTL expired <<<<<<<<<<<<<<<<,
    2017/08/03 18:19:17.098803 [DEBUG] worker: dequeued evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:17.099600 [DEBUG] sched: <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>: Total changes: (place 2) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "cache": (place 2) (inplace 0) (destructive 0) (stop 2) (migrate 0) (ignore 8) (canary 0)
    2017/08/03 18:19:17.101926 [DEBUG] worker: submitted plan for evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:17.102754 [DEBUG] sched: <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>: setting status to complete
    2017/08/03 18:19:17.103277 [DEBUG] worker: updated evaluation <Eval 'b9c82172-a0c4-ed66-06c8-0b52b350f06b' JobID: 'example'>
    2017/08/03 18:19:17.103556 [DEBUG] worker: ack for evaluation b9c82172-a0c4-ed66-06c8-0b52b350f06b
    2017/08/03 18:19:21.752237 [WARN] nomad.heartbeat: node 'cc9ae5db-d01e-682f-94ce-c3478286dee0' TTL expired <<<<<<<<<<<<
    2017/08/03 18:19:21.753612 [DEBUG] worker: dequeued evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:21.754071 [DEBUG] sched: <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>: Total changes: (place 2) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "cache": (place 2) (inplace 0) (destructive 0) (stop 2) (migrate 0) (ignore 8) (canary 0)
    2017/08/03 18:19:21.756427 [DEBUG] worker: submitted plan for evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:21.756473 [DEBUG] sched: <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>: setting status to complete
    2017/08/03 18:19:21.756679 [DEBUG] worker: updated evaluation <Eval '0880ddf6-b009-7828-7576-1444ded1c0ae' JobID: 'example'>
    2017/08/03 18:19:21.756793 [DEBUG] worker: ack for evaluation 0880ddf6-b009-7828-7576-1444ded1c0ae
    2017/08/03 18:19:24.148579 [WARN] nomad.heartbeat: node '720ba291-a064-6747-69af-808817371fa5' TTL expired <<<<<<<<<<<<<<<<

Further nomad status marks them as lost which shows that it handled the node going down:

$ nomad status e
ID            = example
Name          = example
Submit Date   = 08/03/17 18:17:49 UTC
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         10       0       0         6

Latest Deployment
ID          = 490d6243
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       10       16      0        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
6d225fd0  67ceb1bb  cache       0        run      running  08/03/17 18:19:24 UTC
31ff74f0  559b6fc1  cache       0        run      running  08/03/17 18:19:24 UTC
48d1a318  e01201a2  cache       0        run      running  08/03/17 18:19:21 UTC
22fbd8a4  720ba291  cache       0        stop     lost     08/03/17 18:19:21 UTC
0dcb69e2  cc9ae5db  cache       0        stop     lost     08/03/17 18:19:17 UTC
3a44961b  03d001fc  cache       0        run      running  08/03/17 18:19:17 UTC
0d5e8719  67ceb1bb  cache       0        run      running  08/03/17 18:17:49 UTC
494d497a  67ceb1bb  cache       0        run      running  08/03/17 18:17:49 UTC
53849776  e01201a2  cache       0        run      running  08/03/17 18:17:49 UTC
198a509e  559b6fc1  cache       0        run      running  08/03/17 18:17:49 UTC
91b5be6d  f78619f4  cache       0        stop     lost     08/03/17 18:17:49 UTC
957d4d86  03d001fc  cache       0        run      running  08/03/17 18:17:49 UTC
a4c359b0  f78619f4  cache       0        stop     lost     08/03/17 18:17:49 UTC
b40127c3  720ba291  cache       0        stop     lost     08/03/17 18:17:49 UTC
f60d0a62  559b6fc1  cache       0        run      running  08/03/17 18:17:49 UTC
f70da716  cc9ae5db  cache       0        stop     lost     08/03/17 18:17:49 UTC

I wonder if you just do not have enough resources left over the remaining clients for Nomad to be able to reschedule the work.

@ebdb
Copy link
Author

ebdb commented Aug 3, 2017

The status for the service job:

ID            = localization
Name          = localization
Submit Date   = 08/03/17 11:21:08 PDT
Type          = service
Priority      = 50
Datacenters   = us-corp-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
query       0       0         2        0       0         0
update      0       0         1        0       0         1

Latest Deployment
ID          = 51c59e39
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
query       2        2       0        2
update      2        2       0        2

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created At
447dc733  bda07614  update      0        stop     lost     08/03/17 11:21:08 PDT
4808a9de  e5333bf8  query       0        run      running  08/03/17 11:21:08 PDT
86f4d140  b84ac250  update      0        run      running  08/03/17 11:21:08 PDT
c06935bc  a2049b3b  query       0        run      running  08/03/17 11:21:08 PDT

And the logs from the cluster leader:

2017/08/03 11:28:52.227492 [WARN] nomad.heartbeat: node 'bda07614-2c53-e3d6-11bd-80a72a3af1fd' TTL expired
2017/08/03 11:28:52.236659 [DEBUG] worker: dequeued evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.236909 [DEBUG] sched: <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>: allocs: (place 4) (update 0) (migrate 0) (stop 0) (ignore 6) (lost 2)
2017/08/03 11:28:52.243230 [DEBUG] worker: submitted plan for evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.243470 [DEBUG] sched: <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>: setting status to complete
2017/08/03 11:28:52.251200 [DEBUG] worker: updated evaluation <Eval '89320c92-fd5c-fe1b-3fe7-a448c1408311' JobID: 'fabio'>
2017/08/03 11:28:52.251417 [DEBUG] worker: ack for evaluation 89320c92-fd5c-fe1b-3fe7-a448c1408311
2017/08/03 11:28:52.251574 [DEBUG] worker: dequeued evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a
2017/08/03 11:28:52.251738 [DEBUG] sched: <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>: allocs: (place 2) (update 0) (migrate 0) (stop 0) (ignore 3) (lost 1)
2017/08/03 11:28:52.284653 [DEBUG] worker: submitted plan for evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a
2017/08/03 11:28:52.285901 [DEBUG] sched: <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>: setting status to complete
2017/08/03 11:28:52.289766 [DEBUG] worker: updated evaluation <Eval 'def3fa52-a8fd-1930-27c1-d60a5a0d240a' JobID: 'authentication'>
2017/08/03 11:28:52.289948 [DEBUG] worker: ack for evaluation def3fa52-a8fd-1930-27c1-d60a5a0d240a

The scheduler logs information about the two system jobs (fabio, authentication), but does not even mention the service job.

As far as resource are concerned, all of the hosts have resources to spare, such as this node:

CPU            Memory          Disk            IOPS
2000/6602 MHz  1.0 GiB/15 GiB  2.9 GiB/48 GiB  0/0

Or this node:

CPU            Memory          Disk            IOPS
2100/6598 MHz  1.1 GiB/15 GiB  3.2 GiB/47 GiB  0/0

The task is requesting:

resources {
	cpu = 100
	memory = 64
	network {
		mbits = 1
		port "app" {}
	}
}

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

@ebdb Interesting. Thanks for following up. Does this happen reliably for you, as in every time a node is drained with a service?

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

@ebdb If this happens again can you paste the output of curl http://127.0.0.1:4646/v1/job/<job_id>/evaluations

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

@ebdb Is it also possible to get more of the logs and the output of nomad inspect <job> as well?

@dadgar
Copy link
Contributor

dadgar commented Aug 3, 2017

@ebdb Feel free to ignore all requests! Figured it out

dadgar added a commit that referenced this issue Aug 4, 2017
This PR allows the scheduler to replace lost allocations even if the job
has a failed or paused deployment. The prior behavior was confusing to
users.

Fixes #2958
@dadgar
Copy link
Contributor

dadgar commented Aug 4, 2017

Hey just to update this issue. Nomad was handling the node failures for service jobs but it was not replacing allocations when the job's most recent deployment was "Failed".

@ebdb
Copy link
Author

ebdb commented Aug 4, 2017

I noticed something that seems strange to me. When I have a job with all of its allocations running and passing consul health checks, I see this as the deployment's status:

ID          = f0e70f2f
Job ID      = localization
Job Version = 0
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
query       2        2       0        2
update      2        2       0        2

Does that seem right to you? Should I open a new issue for that?

@dadgar
Copy link
Contributor

dadgar commented Aug 4, 2017

@ebdb The update stanza defines a min_healthy_time and a healthy_deadline. The Consul checks have to be passing within that window. Further if the tasks don't all start or they are restarting the alloc is considered unhealthy. I would like to make debugging the reason it went unhealthy easier in an upcoming release but likely this is not a bug.

@ebdb
Copy link
Author

ebdb commented Aug 4, 2017

The health checks are passing on their first run and stay that way, so I am not sure why the deployment wouldn't pass right away. Switching health_check to task_states allows the deployment to succeed, but isn't the most desirable situation. I'll keep digging to see if there is something in our environment or job files that might be causing the issue.

Just to rule it out, is there a minimum consul version this works with?

@dadgar
Copy link
Contributor

dadgar commented Aug 4, 2017

@ebdb You can set your log level to trace and look at the client logs to see why it is failing in the mean time. All even remotely recent versions of Consul should be fine!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants