Task shall not be marked as complete when it's killed by node draining? #3691

dukeland9 · 2017-12-27T13:11:40Z

Nomad version

0.7.1

Operating system and Environment details

Ubuntu 14.04 & 16.04

Issue

I'm using nomad to run distributed batch jobs in a ~30-machine cluster.
When I drain a node from the cluster, all running allocations on that node will be killed and marked complete. That will cause the tasks in those allocations would never be rescheduled.
Should the right behavior be: the allocations on the draining node is killed and marked as failed then rescheduled on another node?

jippi · 2017-12-28T17:32:51Z

what kind of job type do you have? batch currently won't be retried afaik

dukeland9 · 2017-12-29T02:54:57Z

@jippi The job type is raw_exec.

I set the retry policy of my batch job to 2 retries in 24h. From my observation, transient task/alloc failures did recover on the same machine or even on another machine.

dukeland9 · 2018-01-05T06:58:21Z

Any updates on this?

schmichael · 2018-01-10T20:17:41Z

Hi @dukeland9,

Sorry you hit this! This was fixed in #3717 and is included in the binary from #3698.

It should only occur when there's not enough cluster resources to immediately replace a drained allocation, so if you're able to add more capacity before draining it should work around the issue.

For example using Nomad 0.7.1 and the demo/vagrant/ server and client configurations as a base I ran a 2 client cluster with this job and configs: https://gist.github.com/schmichael/ac478447c67e5c396d080fc209bcc218

When I drained the node the batch job was running on it exited and was scheduled on the other node.

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad run batch.nomad
==> Monitoring evaluation "7200a490"
    Evaluation triggered by job "sleeper"
    Allocation "db9dbf7a" created: node "83b23692", group "sleeper"
    Allocation "db9dbf7a" status changed: "pending" -> "running"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7200a490" finished with status "complete"

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status db
ID                  = db9dbf7a
Eval ID             = 7200a490
Name                = sleeper.sleeper[0]
Node ID             = 83b23692
Job ID              = sleeper
Job Version         = 0
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 8s ago
Modified            = 8s ago

Task "sleeper" is "running"
Task Resources
CPU        Memory          Disk     IOPS  Addresses
0/100 MHz  23 MiB/300 MiB  300 MiB  0

Task Events:
Started At     = 01/10/18 19:53:55 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type        Description
01/10/18 19:53:55 UTC  Started     Task started by client
01/10/18 19:53:55 UTC  Task Setup  Building Task Directory
01/10/18 19:53:55 UTC  Received    Task received by client

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad node-drain -enable 83b23692
Are you sure you want to enable drain mode for node "83b23692-afca-3199-f449-b32c380f0b9f"? [y/N] y

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status sleeper
ID            = sleeper
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
7f7f8ed9  b64a527b  sleeper     0        run      running   27s ago  27s ago
db9dbf7a  83b23692  sleeper     0        stop     complete  54s ago  27s ago

Sorry for the hassle! I'm closing this since it's fixed on master, but please reopen if you find that's not the case!

github-actions · 2022-12-04T02:17:10Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chelseakomlo added type/bug theme/scheduling labels Jan 4, 2018

schmichael closed this as completed Jan 10, 2018

github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task shall not be marked as complete when it's killed by node draining? #3691

Task shall not be marked as complete when it's killed by node draining? #3691

dukeland9 commented Dec 27, 2017

jippi commented Dec 28, 2017

dukeland9 commented Dec 29, 2017

dukeland9 commented Jan 5, 2018

schmichael commented Jan 10, 2018

github-actions bot commented Dec 4, 2022

Task shall not be marked as complete when it's killed by node draining? #3691

Task shall not be marked as complete when it's killed by node draining? #3691

Comments

dukeland9 commented Dec 27, 2017

Nomad version

Operating system and Environment details

Issue

jippi commented Dec 28, 2017

dukeland9 commented Dec 29, 2017

dukeland9 commented Jan 5, 2018

schmichael commented Jan 10, 2018

github-actions bot commented Dec 4, 2022