Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-side restarts of tasks failed on clients #1461

Closed
dadgar opened this issue Jul 22, 2016 · 15 comments
Closed

Server-side restarts of tasks failed on clients #1461

dadgar opened this issue Jul 22, 2016 · 15 comments

Comments

@dadgar
Copy link
Contributor

dadgar commented Jul 22, 2016

If a task has failed on a client and it could potentially be recoverable on another, the server should replace the task group onto a new node.

@camerondavison
Copy link
Contributor

has there been any thought to putting this on the roadmap, and/or mitigating it some other way. Currently if I shut down a nomad client server let say to upgrade nomad or upgrade the OS then there is a very likely chance that some of the tasks randomly fail when trying to start up on another server. The random failure usually comes from some docker race condition, but be that as it may I would prefer to not have to resubmit my job just because 1 machine failed in order to get all of the tasks running again.

@dadgar
Copy link
Contributor Author

dadgar commented Feb 27, 2017

@a86c6f7964 It is something we are hoping to tackle in 0.6.0

@dadgar dadgar added this to the v0.6.0 milestone Feb 27, 2017
@dadgar dadgar removed this from the v0.6.0 milestone May 11, 2017
@burdandrei
Copy link
Contributor

so not 0.7, 0.8?

@jovandeginste
Copy link

I usually drain servers for maintenance. This obviously doesn't work when a server randomly fails, but for upgrades this seems to work pretty well so far. Then again I don't usually have multi-task taskgroups.

@SoMuchToGrok
Copy link

SoMuchToGrok commented Nov 9, 2017

Occasionally when I drain a node, some jobs won't be re-allocated and will remain in a dead state with "alloc not needed as node is tainted". Doesn't happen often, but when it does, it quickly becomes a pretty major issue (not easy to have visibility into these issues w/o building some nice monitoring + alerting around everything).

Hard to definitively say that I'm running into this exact problem, but it definitely feels that way. Is there any update for this on the roadmap? I feel like this is a critical issue for me.

Only relevant logs I could find, may or may not be helpful:
https://pastebin.com/raw/G9vdxYEG

@burdandrei
Copy link
Contributor

I received even greater experience - I had a job with multiple groups. After node fails i received some groups relocated, and some remain dead.

@samart
Copy link

samart commented Dec 14, 2017

per job, it would be nice to set:
task_unreacheable_timeout
task_gone_timeout

If Nomad cannot see that a service task is running (unreacheable) or sees that the task is gone, it should try to schedule it on another host.

@SoMuchToGrok
Copy link

SoMuchToGrok commented Feb 7, 2018

Any update on this from a roadmap perspective?

I've experienced node failures in AWS EC2, and this has bitten me a few times now. It also has happened during downscale events with an EC2 ASG - AWS only waits so long when terminating an instance before it forcefully kills everything. Given that a downscale first requires a drain (which can sometimes take minutes), AWS is almost always forcefully killing our nomad clients. The desired behavior here is more or less a requirement for a scheduler - node failures are a guarantee in the cloud.

Is there anything the community can do to help push this along? I'd take a stab at it myself, but it would take some time for me to get up to speed on the internals (but willing to do so, if needed).

@preetapan
Copy link
Contributor

@SoMuchToGrok This is landing in Nomad 0.8. You can follow this branch for details if interested.

@shantanugadgil
Copy link
Contributor

I hit similar issues when I upgrade my compute fleet in a serial manner as @a86c6f7964
Sometimes the Docker daemon fails to respond after the OS packages upgrade.

I hadn't thought of draining the node before shutting down Nomad, Consul, Docker and updating the OS packages.

@dkua
Copy link

dkua commented Apr 10, 2018

@preetapan the branch mentioned doesn't seem to exist anymore and the 0.8 CHANGELOG doesn't seem to include it https://github.com/hashicorp/nomad/blob/b-canary-auto/CHANGELOG.md. Is this no longer on the roadmap/being worked on, I don't see any mention anywhere.

@preetapan
Copy link
Contributor

@dkua the changelog mentions it as follows

core: Failed tasks are automatically rescheduled according to user specified criteria. For more information on configuration, see the Reshedule Stanza [GH-3981]

Docs for rescheduling

@dkua
Copy link

dkua commented Apr 10, 2018

@preetapan ah okay thank you that's great to know gonna let my team know. Didn't notice it at first since the branch-to-follow 404s and #3981 doesn't reference this issue.

@preetapan
Copy link
Contributor

This was addressed with rescheduling in 0.8.

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants