Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mishandling of service deregistration with script healthcheck #10482

Closed
fredwangwang opened this issue Apr 30, 2021 · 2 comments
Closed

mishandling of service deregistration with script healthcheck #10482

fredwangwang opened this issue Apr 30, 2021 · 2 comments
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul theme/service-discovery type/bug

Comments

@fredwangwang
Copy link
Contributor

Nomad version

v1.0.4

Operating system and Environment details

linux

Issue

when stop a job having a service with script check, the following warn msg gets shown in nomad:

[WARN]  client.alloc_runner.task_runner.task_hook.script_checks: updating check failed: alloc_id=4f0f5dac-3497-4bcc-e248-90bb8364ac73 task=huan-fake error="Unexpected response code: 500 (Unknown check "_nomad-check-5c9d58746b0a2a10bfaf3fc6ae306eb22a61e4a1")"

and in consul:

[ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/update/_nomad-check-5c9d58746b0a2a10bfaf3fc6ae306eb22a61e4a1 from=127.0.0.1:36812 error="Unknown check "_nomad-check-5c9d58746b0a2a10bfaf3fc6ae306eb22a61e4a1""
[ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/update/_nomad-check-5c9d58746b0a2a10bfaf3fc6ae306eb22a61e4a1 from=127.0.0.1:36812 error="Unknown check "_nomad-check-5c9d58746b0a2a10bfaf3fc6ae306eb22a61e4a1""

Reproduction steps

Deploy a job having a service with script check, then stop the job.

Expected Result

No warn/error msg for a normal job shutdown

Actual Result

Warn/error msg as shown above

Job file (if appropriate)

Job file
job "huan" {
  region      = "global"
  datacenters = ["dc1"]
  type        = "service"

  group "huan-group" {
    count = 1

    network {
      mode = "bridge"
      port "api" {}
    }
    service {
      name = "huan-api"
      port = "api"

      check {
        type     = "script"
        name     = "check"
        command  = "echo"
        args     = ["1"]
        task     = "huan-fake"
        interval = "60s"
        timeout  = "5s"
      }
    }

    task "huan-fake" {
      driver = "docker"

      config {
        image = "nicholasjackson/fake-service:v0.12.0"
      }

      env {
        LISTEN_ADDR             = "0.0.0.0:${NOMAD_PORT_api}"
        NAME                    = "${NOMAD_TASK_NAME}"
        MESSAGE                 = "${NOMAD_TASK_NAME}:${NOMAD_ALLOC_ID} HostIp: ${attr.unique.network.ip-address}"
      }
    }
  }
}

Nomad Client logs (if appropriate)

see above

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 30, 2021
@tgross
Copy link
Member

tgross commented Apr 30, 2021

Hi @fredwangwang! I was able to reproduce this with that job (thanks for providing a minimal repro!). Here's the relevant logs that I'm getting out of Consul:

2021-04-30T18:24:55.774Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/agent/service/deregister/_nomad-task-d97e5017-74f2-ea15-281b-3ec9c2718f80-group-huan-group-huan-api-api?ns=default from=127.0.0.1:33876 latency=22.034826ms
...
2021-04-30T18:24:55.962Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/update/_nomad-check-8c6fedfd6e7d799e2b207e5cc9b6fa518c89fd9e from=127.0.0.1:34004 error="CheckID "default/_nomad-check-8c6fedfd6e7d799e2b207e5cc9b6fa518c89fd9e" does not have associated TTL"

So you can see we deregister the service and the checks but not atomically. This is a long-standing issue (see #3935). Fortunately in this case you're just getting some log noise, but it's definitely on our roadmap to fix.

@tgross
Copy link
Member

tgross commented May 14, 2024

Doing a review of open service discovery issues because there are a lot of duplicates of #16616 which I'm working on. This issue is not a duplicate, but was fixed by #14944 which closed #3935. That shipped in 1.4.2 with backports.

@tgross tgross closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul theme/service-discovery type/bug
Projects
None yet
Development

No branches or pull requests

4 participants