Server-side restarts of tasks failed on clients #1461

dadgar · 2016-07-22T23:56:25Z

If a task has failed on a client and it could potentially be recoverable on another, the server should replace the task group onto a new node.

camerondavison · 2017-02-27T20:26:51Z

has there been any thought to putting this on the roadmap, and/or mitigating it some other way. Currently if I shut down a nomad client server let say to upgrade nomad or upgrade the OS then there is a very likely chance that some of the tasks randomly fail when trying to start up on another server. The random failure usually comes from some docker race condition, but be that as it may I would prefer to not have to resubmit my job just because 1 machine failed in order to get all of the tasks running again.

dadgar · 2017-02-27T21:54:47Z

@a86c6f7964 It is something we are hoping to tackle in 0.6.0

burdandrei · 2017-10-11T17:56:18Z

so not 0.7, 0.8?

jovandeginste · 2017-10-12T07:16:41Z

I usually drain servers for maintenance. This obviously doesn't work when a server randomly fails, but for upgrades this seems to work pretty well so far. Then again I don't usually have multi-task taskgroups.

SoMuchToGrok · 2017-11-09T20:53:47Z

Occasionally when I drain a node, some jobs won't be re-allocated and will remain in a dead state with "alloc not needed as node is tainted". Doesn't happen often, but when it does, it quickly becomes a pretty major issue (not easy to have visibility into these issues w/o building some nice monitoring + alerting around everything).

Hard to definitively say that I'm running into this exact problem, but it definitely feels that way. Is there any update for this on the roadmap? I feel like this is a critical issue for me.

Only relevant logs I could find, may or may not be helpful:
https://pastebin.com/raw/G9vdxYEG

burdandrei · 2017-11-09T20:56:06Z

I received even greater experience - I had a job with multiple groups. After node fails i received some groups relocated, and some remain dead.

samart · 2017-12-14T06:47:37Z

per job, it would be nice to set:
task_unreacheable_timeout
task_gone_timeout

If Nomad cannot see that a service task is running (unreacheable) or sees that the task is gone, it should try to schedule it on another host.

SoMuchToGrok · 2018-02-07T13:08:30Z

Any update on this from a roadmap perspective?

I've experienced node failures in AWS EC2, and this has bitten me a few times now. It also has happened during downscale events with an EC2 ASG - AWS only waits so long when terminating an instance before it forcefully kills everything. Given that a downscale first requires a drain (which can sometimes take minutes), AWS is almost always forcefully killing our nomad clients. The desired behavior here is more or less a requirement for a scheduler - node failures are a guarantee in the cloud.

Is there anything the community can do to help push this along? I'd take a stab at it myself, but it would take some time for me to get up to speed on the internals (but willing to do so, if needed).

preetapan · 2018-02-07T13:44:18Z

@SoMuchToGrok This is landing in Nomad 0.8. You can follow this branch for details if interested.

shantanugadgil · 2018-02-07T15:47:48Z

I hit similar issues when I upgrade my compute fleet in a serial manner as @a86c6f7964
Sometimes the Docker daemon fails to respond after the OS packages upgrade.

I hadn't thought of draining the node before shutting down Nomad, Consul, Docker and updating the OS packages.

dkua · 2018-04-10T20:17:32Z

@preetapan the branch mentioned doesn't seem to exist anymore and the 0.8 CHANGELOG doesn't seem to include it https://github.com/hashicorp/nomad/blob/b-canary-auto/CHANGELOG.md. Is this no longer on the roadmap/being worked on, I don't see any mention anywhere.

preetapan · 2018-04-10T21:03:45Z

@dkua the changelog mentions it as follows

core: Failed tasks are automatically rescheduled according to user specified criteria. For more information on configuration, see the Reshedule Stanza [GH-3981]

Docs for rescheduling

dkua · 2018-04-10T21:45:33Z

@preetapan ah okay thank you that's great to know gonna let my team know. Didn't notice it at first since the branch-to-follow 404s and #3981 doesn't reference this issue.

preetapan · 2018-04-16T17:49:58Z

This was addressed with rescheduling in 0.8.

github-actions · 2022-12-01T02:28:16Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/core theme/scheduling labels Jul 22, 2016

dadgar mentioned this issue Jul 22, 2016

"docker image does not exist" should be a recoverable error #1406

Closed

camerondavison mentioned this issue Jan 6, 2017

"no such image" - unrecoverable error #2154

Closed

dadgar added this to the v0.6.0 milestone Feb 27, 2017

dadgar removed this from the v0.6.0 milestone May 11, 2017

dadgar mentioned this issue Aug 14, 2017

Reschedule and relocate container that won't start on a host #3015

Closed

preetapan closed this as completed Apr 16, 2018

github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022

hc-github-team-nomad-core assigned tgross May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side restarts of tasks failed on clients #1461

Server-side restarts of tasks failed on clients #1461

dadgar commented Jul 22, 2016

camerondavison commented Feb 27, 2017

dadgar commented Feb 27, 2017

burdandrei commented Oct 11, 2017

jovandeginste commented Oct 12, 2017

SoMuchToGrok commented Nov 9, 2017 •

edited

Loading

burdandrei commented Nov 9, 2017

samart commented Dec 14, 2017

SoMuchToGrok commented Feb 7, 2018 •

edited

Loading

preetapan commented Feb 7, 2018

shantanugadgil commented Feb 7, 2018

dkua commented Apr 10, 2018

preetapan commented Apr 10, 2018

dkua commented Apr 10, 2018

preetapan commented Apr 16, 2018

github-actions bot commented Dec 1, 2022

Server-side restarts of tasks failed on clients #1461

Server-side restarts of tasks failed on clients #1461

Comments

dadgar commented Jul 22, 2016

camerondavison commented Feb 27, 2017

dadgar commented Feb 27, 2017

burdandrei commented Oct 11, 2017

jovandeginste commented Oct 12, 2017

SoMuchToGrok commented Nov 9, 2017 • edited Loading

burdandrei commented Nov 9, 2017

samart commented Dec 14, 2017

SoMuchToGrok commented Feb 7, 2018 • edited Loading

preetapan commented Feb 7, 2018

shantanugadgil commented Feb 7, 2018

dkua commented Apr 10, 2018

preetapan commented Apr 10, 2018

dkua commented Apr 10, 2018

preetapan commented Apr 16, 2018

github-actions bot commented Dec 1, 2022

SoMuchToGrok commented Nov 9, 2017 •

edited

Loading

SoMuchToGrok commented Feb 7, 2018 •

edited

Loading