Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add option to nullify docker container healthchecks #5310

Closed
hvindin opened this issue Feb 9, 2019 · 3 comments · Fixed by #14089
Closed

[Feature Request] Add option to nullify docker container healthchecks #5310

hvindin opened this issue Feb 9, 2019 · 3 comments · Fixed by #14089

Comments

@hvindin
Copy link

hvindin commented Feb 9, 2019

Description

When building docker containers the option exists to declare a HEALTHCHECK during the build process. Nomad doesn't use this information at all and declares its own checks to monitor container health.

Essentially this means that unless you are running more than one scheduler to manage your containers there is no reason to have these health checks running.

Thus, it would be nice to have the ability to disable native healthchecks at runtime if the containers being scheduled had a useless but resource consuming HEALTHCHECK defined.

Use case

A long time ago, possibly before we even started using nomad to manage our clusters, I made the extremely naive decision to throw in a

HEALTHCHECK --interval=2s --timeout=5s --retries=5 CMD curl -Ssi http://127.0.0.1:8080/healthcheck | grep -q 200 

In some of our Dev and Test environments we have up to 70 jobs running per node. Some of which contain code which takes a long time to start up. looking at the processes running on a node if we have just drained one of these giant nodes and have 60+ jobs starting at once reveals literally thousands of curl commands building up before things stabilise. Furthermore, after some time, when a couple of jobs on a node have become unhealthy the docker daemon seems to lock up and nomad is unable to remove the now-unresponsive container.

We didn't actually notice that this was an issue previously because our nodes were a more sensible size so the cumulative impact went unnoticed.

With this issue our current way forward is likely going to be to just throw another layer onto currently running versions of containers with

HEALTHCHECK NONE

as the only change. However one of the fun things about working in a large organisation is that I'm sure that some of the technical owners are going to want to manually regression test the change because it's a change they don't understand to their container.

Examples

The docker CLI provides a --no-healthcheck option at run-time. The API allows for NONE to be passed in to disable any predefined health checks. From memory, Kubernetes disables docker health checks all the time since ~mid-2017ish.

So this seems like it should be incredibly simple to do at the point in time that container configuration is being put together before the container starts.

My assumption would be that it would be better to default to leaving docker health checks as they are, on the off chance that the results are being used by someone for something, but to provide an option in the task config to disable native health checks if desired.

@rgl
Copy link

rgl commented Dec 15, 2020

Also please consider the other way around, using the status of the HEALTHCHECK CMD defined in the container as a valid check and propagate that to nomad/consul. That is, when the container has an HEALTHCHECK CMD use it (maybe in addition to the checks that are defined in the nomad job).

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@Amier3 Amier3 added the help-wanted We encourage community PRs for these issues! label Apr 1, 2022
@SrMouraSilva
Copy link

I agree with @rgl.
I think that, now Nomad (1.3.0) has a Native Discovery Server, it also could be informs the container status to Services API. At this way, Traefik or other reverse proxy could consider the docker container healthy state to decide if it can be exposed or not. Or maybe prevent to deploy an unhealthy Canary.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants