Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task shall not be marked as complete when it's killed by node draining? #3691

Closed
dukeland9 opened this issue Dec 27, 2017 · 5 comments
Closed

Comments

@dukeland9
Copy link

Nomad version

0.7.1

Operating system and Environment details

Ubuntu 14.04 & 16.04

Issue

I'm using nomad to run distributed batch jobs in a ~30-machine cluster.
When I drain a node from the cluster, all running allocations on that node will be killed and marked complete. That will cause the tasks in those allocations would never be rescheduled.
Should the right behavior be: the allocations on the draining node is killed and marked as failed then rescheduled on another node?

image

image

@jippi
Copy link
Contributor

jippi commented Dec 28, 2017

what kind of job type do you have? batch currently won't be retried afaik

@dukeland9
Copy link
Author

@jippi The job type is raw_exec.

I set the retry policy of my batch job to 2 retries in 24h. From my observation, transient task/alloc failures did recover on the same machine or even on another machine.

@dukeland9
Copy link
Author

Any updates on this?

@schmichael
Copy link
Member

Hi @dukeland9,

Sorry you hit this! This was fixed in #3717 and is included in the binary from #3698.

It should only occur when there's not enough cluster resources to immediately replace a drained allocation, so if you're able to add more capacity before draining it should work around the issue.

For example using Nomad 0.7.1 and the demo/vagrant/ server and client configurations as a base I ran a 2 client cluster with this job and configs: https://gist.github.com/schmichael/ac478447c67e5c396d080fc209bcc218

When I drained the node the batch job was running on it exited and was scheduled on the other node.

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad run batch.nomad
==> Monitoring evaluation "7200a490"
    Evaluation triggered by job "sleeper"
    Allocation "db9dbf7a" created: node "83b23692", group "sleeper"
    Allocation "db9dbf7a" status changed: "pending" -> "running"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7200a490" finished with status "complete"

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status db
ID                  = db9dbf7a
Eval ID             = 7200a490
Name                = sleeper.sleeper[0]
Node ID             = 83b23692
Job ID              = sleeper
Job Version         = 0
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 8s ago
Modified            = 8s ago

Task "sleeper" is "running"
Task Resources
CPU        Memory          Disk     IOPS  Addresses
0/100 MHz  23 MiB/300 MiB  300 MiB  0

Task Events:
Started At     = 01/10/18 19:53:55 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type        Description
01/10/18 19:53:55 UTC  Started     Task started by client
01/10/18 19:53:55 UTC  Task Setup  Building Task Directory
01/10/18 19:53:55 UTC  Received    Task received by client

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad node-drain -enable 83b23692
Are you sure you want to enable drain mode for node "83b23692-afca-3199-f449-b32c380f0b9f"? [y/N] y

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status sleeper
ID            = sleeper
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
7f7f8ed9  b64a527b  sleeper     0        run      running   27s ago  27s ago
db9dbf7a  83b23692  sleeper     0        stop     complete  54s ago  27s ago

Sorry for the hassle! I'm closing this since it's fixed on master, but please reopen if you find that's not the case!

@github-actions
Copy link

github-actions bot commented Dec 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants