Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad should try and schedule enough healthy allocations to meet the count requirement without being bound by broken unhealthy allocations that cant be removed. #4862

Closed
hvindin opened this issue Nov 10, 2018 · 6 comments

Comments

@hvindin
Copy link

hvindin commented Nov 10, 2018

Nomad version

Nomad v0.8.6+ent (306d85c+CHANGES)
(also version 0.8.5)

Operating system and Environment details

RHEL 7.3 & RHEL 7.4
Docker EE 18.03.3 & Docker EE 18.03.4

Issue

When a container allocation becomes unhealthy and nomad tries to reschedule it we have observed instances where docker fails to remove the container (which is definitely a docker problem, their RHEL support is a little lacking) and nomad, having failed to stop the running contained just hangs and doesn't attempt to schedule a new one on another host.

I know that I might technically only be declaring in my job file that I want "count = 2" but it would seem logical that this should imply that I actually want 2 healthy allocations. In which case it shouldnt matter if nomad and docker tie themselves in knots trying to clean up the unhealthy allocation, I would want nomad to get working on creating a new healthy allocation, possibly on another node if it detects a failure removing the unhealthy allocation.

Reproduction steps

This is tricky and I'll update here when I figure out how to successfully reproduce, but for now it may be worth considering this a question of nomads intended behavior vs what it does when an allocation cant be removed for some reason.

@hvindin hvindin changed the title Nomad should try and schedule enough healthy allocations to meet the count requirement withiut being bound by broken unhealthy allocations that cant be removed. Nomad should try and schedule enough healthy allocations to meet the count requirement without being bound by broken unhealthy allocations that cant be removed. Nov 10, 2018
@preetapan
Copy link
Contributor

@hvindin Can you share the output of nomad alloc status <> of the container that's in this state. In general if its ClientStatus is failed, it will get rescheduled on to another node.

@stevenscg
Copy link

@hvindin I think I have similar problems on a similar setup as you: Nomad 0.8.6, CentOS 7.5, Docker CE 18.06.1-ce (overlay2 on xfs).

This condition often results in service outages for us and multiple total node failures throughout the cluster.

Do you see something like this when looking at an alloc status??

Recent Events:
Time                  Type              Description
2018-11-19T16:19:57Z  Not Restarting    Error was unrecoverable
2018-11-19T16:19:57Z  Driver Failure    failed to start task "api" for alloc "2a77ecf1-bba3-2e77-854e-522a5fa70b72": Failed to create container: Failed to purge container 20cec463f38e8f8de766ab7d53550afb187dddaf284c59d6abfd9dcca97f47dc: API error (409): {"message":"removal of container 20cec463f38e8f8de766ab7d53550afb187dddaf284c59d6abfd9dcca97f47dc is already in progress"}
2018-11-19T16:19:57Z  Restarting        Task restarting in 0s
2018-11-19T16:14:57Z  Killed            Task successfully killed
2018-11-19T16:14:42Z  Killing           Sent interrupt. Waiting 30s before force killing
2018-11-19T16:14:42Z  Restart Signaled  Template: template with change_mode restart re-rendered
2018-11-19T16:14:20Z  Signaling         Task being sent signal user defined signal 2: Template: template re-rendered
2018-11-19T12:31:39Z  Signaling         Task being sent signal user defined signal 2: Template: template re-rendered
2018-11-19T12:14:54Z  Signaling         Task being sent signal user defined signal 2: Template: template re-rendered
2018-11-18T14:10:09Z  Started           Task started by client

@hvindin
Copy link
Author

hvindin commented Nov 21, 2018

@preetapan finally found time to get some more details :)

ID                  = 4a50ffc0
Eval ID             = 1e2b31ff
Name                = ptr-openbnkng-acctinf-v1.ptr-openbnkng-acctinf-v1[1]
Node ID             = b3f61758
Job ID              = ptr-openbnkng-acctinf-v1
Job Version         = 0
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 18d16h ago
Modified            = 11d15h ago
Deployment ID       = 36d8e9c0
Deployment Health   = healthy

Task "ptr-openbnkng-acctinf-v1-exec" is "running"
Task Resources
CPU           Memory           Disk     IOPS  Addresses
262/1000 MHz  271 MiB/1.0 GiB  300 MiB  0     ssl: 10.41.130.181:27191

Task Events:
Started At     = 2018-11-09T09:26:16Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2018-11-09T20:22:14+11:00

Recent Events:
Time                       Type                   Description
2018-11-09T20:31:28+11:00  Killing                Sent interrupt. Waiting 5s before force killing
2018-11-09T20:31:28+11:00  Restart Signaled       healthcheck: check "API HTTP Health Check" unhealthy
2018-11-09T20:26:16+11:00  Started                Task started by client
2018-11-09T20:22:14+11:00  Restarting             Task restarting in 11.252425842s
2018-11-09T20:22:08+11:00  Killed                 Task successfully killed
2018-11-09T20:20:17+11:00  Killing                Sent interrupt. Waiting 5s before force killing
2018-11-09T20:20:17+11:00  Restart Signaled       healthcheck: check "API HTTP Health Check" unhealthy
2018-11-02T20:02:43+11:00  Started                Task started by client
2018-11-02T20:01:03+11:00  Driver                 Downloading image docker.wdp.local/ptr/docker-ptr-openbnkng-acctinf-v1:209
2018-11-02T20:01:02+11:00  Downloading Artifacts  Client is downloading artifacts

And the timestamp of actually running the command was:
2018-11-21T12:10:11+11:00

Looking at the consul logs:

connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:12:47 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:12:50 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:12:52 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:12:55 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:12:57 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:00 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:02 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:05 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:07 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:10 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:12 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:15 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:17 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:20 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:22 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:25 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:27 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:30 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:32 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:35 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:37 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:40 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:42 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:45 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:47 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:50 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:52 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:55 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:13:57 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:00 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-task-aqlukbltl5vqyj4y3eki32zms476j547"
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-task-woldlynwx4uqkisyp6kbhrg5z7vo766b"
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-client-hqqiak7hc2bs7ki43iqobdolwgfejwrt"
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-task-4eukvz5otx6lwyrzyxtwpgijh6pl2aoy"
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-task-b46w62xskmdsivqqqcspqpteq6vue4j4"
2018/11/21 12:14:01 [INFO] agent: Synced service "_nomad-task-3dzzqstyizl5yqrkf4glinmrado3cfe5"
2018/11/21 12:14:02 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:05 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:07 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:10 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:12 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:15 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:17 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:20 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:22 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:25 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:27 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:30 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:32 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:35 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:37 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:40 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:42 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:45 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:47 [WARN] agent: Check "19e1fdc84c31d88e83aac133e79dbb63978aac04" HTTP request failed: Get https://10.41.130.181:27191/ptr/openbnkng/acctinf/v1/healthcheck: dial tcp 10.41.130.181:27191: connect: connection refused
2018/11/21 12:14:50 [WARN] agent: Check "a795ad56f8a67c570720f83c510e8bedc1df2d52" socket connection failed: dial tcp 10.41.130.181:27191: connect: connection refused

Which just scroll back all the way to 2018-11-09T20:31:28+11:00

While the docker logs have since rolled I recall seeing something about there being an error with the healthcheck that was built into the docker container. Which makes sense as the status of the container is Up 11 days (unhealthy)

So the container is both unhealthy according to consul as well as unhealthy according to docker.

I suspect that the key problem here is that nomad, for whatever reason, still has the alloc marked as "Healthy" despite all the healthchecks we have defined failing.

@stevenscg I think thig might be a different issue to the one you are seeing as the allocation comes up fine, and it then sits looking healthy to Nomad, however both consul and docker both mark it as unhealthy (which conveniently does mean that it is taken out of the cluster of routable nodes) but is a bit frustrating because it requires some manual intervention to fix as whatever causes the container to become stuck also means that we can't kill the running contianer without either a SIGKILL or bouncing the server.

@hvindin
Copy link
Author

hvindin commented Feb 9, 2019

In looking into this further, the core issue is definitely a docker problem in that a job can end up in an unresponsive state because the docker daemon locks up.

Ignoring all the little things that went wrong at some point to end up leaving jobs in this state we have worked out that the docker healthcheck is essentially killing off our containers when a machine gets too busy (it doesn't help that the CPU on the underlying host, we've since discovered, was overallocated up to 600%+, so lots of weirdness starts happening at that point...).

So, ideally it would be nice if nomad were able to figure out when this sort of thing happened and make sure that there were always $count healthy allocations available, even if other unhealthy allocation couldn't be removed.

It would also be nice if docker were just that little bit more stable (seriously, search for "healthcheck" in the moby issues... apparently this has been happening for ages and every release since 1.11.1 has "fixed" it....)

But in the short term, it seems that if we kill off the docker healthchecks, and also talk angrily at our "cloud" provider about the absurdity of seeing stats about CPU commit on ESX hosts in the high triple digits then this happens so infrequently that we haven't been able to leave it hung long enough to investigate it because we can't really justify it.

That being said, one thing that would be nice is for nomad to let us turn off native healthchecks at run time so we can run the same container versions but redeploy them with the healthchecks set to NONE. So I've created #5310 about that.

@stale
Copy link

stale bot commented May 10, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed May 10, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants