Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script health checks failing in v0.10.0 and newer #7185

Closed
far-blue opened this issue Feb 19, 2020 · 5 comments
Closed

Script health checks failing in v0.10.0 and newer #7185

far-blue opened this issue Feb 19, 2020 · 5 comments

Comments

@far-blue
Copy link

Overview of the Issue

When I update from v0.9.x to v0.10.x I'm finding script based health checks are failing.

The last version of Nomad that works for me is 0.9.7 and the first failing version is 0.10.0

I'm seeing the same problem regardless of the version of Consul in use. Most of my testing was with the latest version of Consul (1.7.0)

Reproduction Steps

I defined a Nomad job with a single group and a docker task. That task has a service section with an associated check. The service definition looks like this:

service {
                address_mode = "driver"
                name = "${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}-${NOMAD_ALLOC_INDEX}"
                check {
                    address_mode = "driver"
                    name = "service ${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}-${NOMAD_ALLOC_INDEX} check"
                    type    = "script"
                    command = "/usr/local/bin/php-fpm-healthcheck"
                    args    = ["--verbose"]
                    interval = "10s"
                    timeout = "2s"
                }
            }

What I'm seeing is that the script check is running (I can see it in the logs for the container) but Consul is reporting the check as failing.

In the Nomad Client logs I'm seeing lines like this when the job is run:

client.alloc_runner.task_runner.task_hook.script_checks: updating check failed: alloc_id=b5b35c01-a02c-9280-cc5c-7e9d82cad360 task=php error="Unexpected response code: 500 (Unknown check "_nomad-check-0a572b0bc970ff486b41cf281b4e0b688b7ef8e9")"

Notice the check ID.

Similarly I can see the following every time the check is run in the consul logs:

[ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/update/_nomad-check-0a572b0bc970ff486b41cf281b4e0b688b7ef8e9 from=172.26.174.34:52084 error="Unknown check "_nomad-check-0a572b0bc970ff486b41cf281b4e0b688b7ef8e9""

Actually asking the local consul agent about the health checks shows that it has a very different idea of the relevant check:

    "_nomad-check-36dc0ce0a6444e63e75419c3f2f5598b006df776": {
        "Node": "argosy1",
        "CheckID": "_nomad-check-36dc0ce0a6444e63e75419c3f2f5598b006df776",
        "Name": "service arya-api-php-0 check",
        "Status": "critical",
        "Notes": "",
        "Output": "TTL expired",
        "ServiceID": "_nomad-task-b5b35c01-a02c-9280-cc5c-7e9d82cad360-php-arya-api-php-0-",
        "ServiceName": "arya-api-php-0",
        "ServiceTags": [
        ],
        "Type": "ttl",
        "Definition": {},
        "CreateIndex": 0,
        "ModifyIndex": 0
    },

Notice the ID is different.

I do have "enable_local_script_checks": true, in the consul config.

Operating system and Environment details

Ubuntu 18.04

@far-blue
Copy link
Author

I've seen that #6916 fixes some script-related health check issues but my understanding is that the fix was released in 0.10.3 and I've tested 0.10.3 and I'm still seeing the same issue.

@tgross
Copy link
Member

tgross commented Feb 19, 2020

Hi @far-blue! Thanks for opening this. That fix was intended for 0.10.3 but unfortunately we ended up having to do a security release. Sorry for the confusion!

If you check the CHANGELOG you'll see that patch is slated for 0.10.4, which is currently available as a release candidate 0.10.4-rc1. Give that a try and if it's not fixed, let me know and I'll definitely take a second look at it.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 19, 2020
@tgross tgross self-assigned this Feb 19, 2020
@tgross tgross moved this from Needs Triage to Waiting Reply in Nomad - Community Issues Triage Feb 19, 2020
@far-blue
Copy link
Author

Hi @tgross :)

Thanks for the quick reply. Yes, now you point it out in the changelog I can see it missed the 0.10.3 release. I've not been able to do a thorough testing with 0.10.4-rc1 as I've rolled my cluster back to 0.9.7 but I did a quick check on a single node by isolating it and I can confirm that, yes, the immediate symptoms of id mismatch errors in the logs and failing health checks are fixed in the release candidate.

As I'm not keen to run an RC in production, do you know if the 0.10.4 release is likely in February or further into the future?

@tgross
Copy link
Member

tgross commented Feb 20, 2020

Looks like we just released Nomad 0.10.4, so you should be all set!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants