-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad should try and schedule enough healthy allocations to meet the count requirement without being bound by broken unhealthy allocations that cant be removed. #4862
Comments
@hvindin Can you share the output of |
@hvindin I think I have similar problems on a similar setup as you: Nomad 0.8.6, CentOS 7.5, Docker CE 18.06.1-ce (overlay2 on xfs). This condition often results in service outages for us and multiple total node failures throughout the cluster. Do you see something like this when looking at an alloc status??
|
@preetapan finally found time to get some more details :)
And the timestamp of actually running the command was: Looking at the consul logs:
Which just scroll back all the way to 2018-11-09T20:31:28+11:00 While the docker logs have since rolled I recall seeing something about there being an error with the healthcheck that was built into the docker container. Which makes sense as the status of the container is So the container is both unhealthy according to consul as well as unhealthy according to docker. I suspect that the key problem here is that nomad, for whatever reason, still has the alloc marked as "Healthy" despite all the healthchecks we have defined failing. @stevenscg I think thig might be a different issue to the one you are seeing as the allocation comes up fine, and it then sits looking healthy to Nomad, however both consul and docker both mark it as unhealthy (which conveniently does mean that it is taken out of the cluster of routable nodes) but is a bit frustrating because it requires some manual intervention to fix as whatever causes the container to become stuck also means that we can't kill the running contianer without either a SIGKILL or bouncing the server. |
In looking into this further, the core issue is definitely a docker problem in that a job can end up in an unresponsive state because the docker daemon locks up. Ignoring all the little things that went wrong at some point to end up leaving jobs in this state we have worked out that the docker healthcheck is essentially killing off our containers when a machine gets too busy (it doesn't help that the CPU on the underlying host, we've since discovered, was overallocated up to 600%+, so lots of weirdness starts happening at that point...). So, ideally it would be nice if nomad were able to figure out when this sort of thing happened and make sure that there were always It would also be nice if docker were just that little bit more stable (seriously, search for "healthcheck" in the moby issues... apparently this has been happening for ages and every release since 1.11.1 has "fixed" it....) But in the short term, it seems that if we kill off the docker healthchecks, and also talk angrily at our "cloud" provider about the absurdity of seeing stats about CPU commit on ESX hosts in the high triple digits then this happens so infrequently that we haven't been able to leave it hung long enough to investigate it because we can't really justify it. That being said, one thing that would be nice is for nomad to let us turn off native healthchecks at run time so we can run the same container versions but redeploy them with the healthchecks set to |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.6+ent (306d85c+CHANGES)
(also version 0.8.5)
Operating system and Environment details
RHEL 7.3 & RHEL 7.4
Docker EE 18.03.3 & Docker EE 18.03.4
Issue
When a container allocation becomes unhealthy and nomad tries to reschedule it we have observed instances where docker fails to remove the container (which is definitely a docker problem, their RHEL support is a little lacking) and nomad, having failed to stop the running contained just hangs and doesn't attempt to schedule a new one on another host.
I know that I might technically only be declaring in my job file that I want "count = 2" but it would seem logical that this should imply that I actually want 2 healthy allocations. In which case it shouldnt matter if nomad and docker tie themselves in knots trying to clean up the unhealthy allocation, I would want nomad to get working on creating a new healthy allocation, possibly on another node if it detects a failure removing the unhealthy allocation.
Reproduction steps
This is tricky and I'll update here when I figure out how to successfully reproduce, but for now it may be worth considering this a question of nomads intended behavior vs what it does when an allocation cant be removed for some reason.
The text was updated successfully, but these errors were encountered: