Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services: zombie service in nomad service provider #14618

Closed
shoenig opened this issue Sep 19, 2022 · 1 comment
Closed

services: zombie service in nomad service provider #14618

shoenig opened this issue Sep 19, 2022 · 1 comment

Comments

@shoenig
Copy link
Member

shoenig commented Sep 19, 2022

$ nomad version 
Nomad v1.4.0-beta.1 (d17021a366178a11b79353f13735389629102d6a)

I've been playing whack-a-bug getting this job to run, there were client restarts and task failures, etc, and eventually got it working but also noticed in the end there are 2 services registered when there should be only 1. Not great, but the real problem now is there doesn't seem to be a way to delete the zombie service. Stopping the job does not clean it up, and the service delete command requires a service_id which I don't(?) have.

Edit: Can get the service_id via nomad service info -verbose <name> and thus delete with

$ nomad service delete hclfmt _nomad-task-3eba4b4e-5ae1-f9f6-7b5b-e727c338663e-hclfmt-hclfmt-http 
Successfully deleted service registration

2 services in provider (unexpected!)

$ nomad service info hclfmt
Job ID  Address          Tags                                                                                                                                                                                                                                                                                                                                                                                                                    Node ID   Alloc ID
hclfmt  10.17.0.5:26782  [traefik.enable=true,traefik.http.routers.hclfmt.rule=Path(`/hclfmt`),traefik.http.routers.hclfmt.middlewares=stripper-hclfmt@nomad,traefik.http.middlewares.stripper-hclfmt.stripprefix.prefixes=/hclfmt,traefik.http.middlewares.stripper-hclfmt.stripprefix.forceSlash=false,traefik.http.routers.hclfmt.tls.certresolver=le,traefik.http.routers.hclfmt.entrypoints=https,traefik.http.routers.hclfmt.priority=70]  0ba430df  3eba4b4e
hclfmt  10.17.0.5:21314  [traefik.enable=true,traefik.http.routers.hclfmt.rule=Path(`/hclfmt`),traefik.http.routers.hclfmt.middlewares=stripper-hclfmt@nomad,traefik.http.middlewares.stripper-hclfmt.stripprefix.prefixes=/hclfmt,traefik.http.middlewares.stripper-hclfmt.stripprefix.forceSlash=false,traefik.http.routers.hclfmt.tls.certresolver=le,traefik.http.routers.hclfmt.entrypoints=https,traefik.http.routers.hclfmt.priority=70]  0ba430df  f8299bc8

but only 1 alloc (with 1 task)

$ nomad job status hclfmt | grep -A3 Allocations
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
f8299bc8  0ba430df  format      0        run      running  1h40m ago  1h40m ago

and for sure just the one instance running (expected)

$ ps -ef | grep hclfmt
anonymo+    1335     778  0 13:25 ?        00:00:00 /opt/bin/hclfmt-web
shoenig     3422    2477  0 15:06 pts/0    00:00:00 grep hclfmt

Only the :21314 should be registered at this point

Allocation Addresses (mode = "host"):
Label  Dynamic  Address
*http  yes      10.17.0.5:21314
job file for reference
job "hclfmt" {
  datacenters = ["nyc3"]
  type        = "service"

  group "format" {
    network {
      mode = "host"
      port "http" {
        host_network = "internal"
      }
    }

    update {
      min_healthy_time = "1s"
    }

    task "hclfmt" {
      driver = "pledge"
      user   = "anonymous"
      config {
        command  = "/opt/bin/hclfmt-web"
        promises = "stdio rpath inet tty"
      }
      env {
        PORT = "${NOMAD_PORT_http}"
        BIND = "${NOMAD_IP_http}"
      }
      resources {
        cpu    = 10
        memory = 16
      }
      service {
        provider = "nomad"
        name     = "hclfmt"
        port     = "http"
        check {
          name     = "hclfmt-tcp"
          type     = "tcp"
          interval = "5s"
          timeout  = "1s"
        }
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.hclfmt.rule=Path(`/hclfmt`)",
          "traefik.http.routers.hclfmt.middlewares=stripper-hclfmt@nomad",
          "traefik.http.middlewares.stripper-hclfmt.stripprefix.prefixes=/hclfmt",
          "traefik.http.middlewares.stripper-hclfmt.stripprefix.forceSlash=false",
          "traefik.http.routers.hclfmt.tls.certresolver=le",
          "traefik.http.routers.hclfmt.entrypoints=https",
          "traefik.http.routers.hclfmt.priority=70",
        ]
      }
    }
  }
}
@shoenig shoenig changed the title servicedico: zombie service in nomad service provider services: zombie service in nomad service provider Sep 19, 2022
shoenig added a commit that referenced this issue May 12, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 12, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
shoenig added a commit that referenced this issue May 15, 2023
This PR fixes a bug where issuing a restart to a terminal allocation
would cause the allocation to run its hooks anyway. This was particularly
apparent with group_service_hook who would then register services but
then never deregister them - as the allocation would be effectively in
a "zombie" state where it is prepped to run tasks but never will.

Fixes #17079
Fixes #16238
Fixes #14618
@tgross tgross self-assigned this May 14, 2024
@tgross
Copy link
Member

tgross commented May 14, 2024

Going to assign myself this as a likely dupe of #16616. If anyone has additional info on this, please see my comment here: #16616 (comment) first and report in that thread.

Closing as duplicate.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants