consul service registration failed - invalid port / port label not found #9709

MorphBonehunter · 2020-12-30T11:01:00Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

ArchLinux 5.4.84-1-lts

Issue

Service registration fails every now and then on resubmitting job (for example update docker image version).

Reproduction steps

I can not reliable reproduce this in my environment as this occurs not on every job update.
It seems to happen after i moved the deprecated network stanza in the task level to the group level and it is not bound to one job, it happens on different jobs when it appears.
It happen to the job below after i change the following (which doesn't touch the service definiton etc.):

Job validation successful
+/- Job: "victoriametrics-job"
+/- Task Group: "victoriametrics-group" (1 in-place update)
  +/- Task: "victoriametrics-task" (forces in-place update)
    +/- Service {
        AddressMode:       "auto"
        EnableTagOverride: "false"
        Name:              "victoriametrics"
        PortLabel:         "http"
        TaskName:          ""
      + Tags {
        + Tags: "prometheus_monitoring"
          Tags: "traefik.enable=true"
          Tags: "traefik.http.routers.victoriametrics.entryPoints=https"
          Tags: "traefik.http.routers.victoriametrics.middlewares=compression@file,secureheaders@file,local-access-only@file"
          Tags: "traefik.http.routers.victoriametrics.rule=Host(`victoriametrics.<redacted>`)"
          Tags: "traefik.http.routers.victoriametrics.tls=true"
          Tags: "traefik.tags=host_${attr.unique.hostname}"
        }
        }

I can fix the problem with an stop/start cycle of the job.

Job file (if appropriate)

job "victoriametrics-job" {
  region = "home"
  datacenters = ["underverse"]
  type = "service"
  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "evelyn"
  }
  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "45s"
    healthy_deadline = "300s"
    auto_revert = true
  }
  group "victoriametrics-group" {
    count = 1
    restart {
      attempts = 3
      interval = "10m"
      delay    = "60s"
      mode     = "fail"
    }
    network {
      port "http" {
        to = 8428
      }
    }
    task "victoriametrics-task" {
      driver = "docker"
      config {
        image = "<redacted>/victoriametrics:2020-12-28-11-21"
        args = [
          "-storageDataPath=/victoriametrics",
          "-retentionPeriod=1y"
        ]
        mount {
          type = "volume"
          source = "victoriametrics"
          target = "/victoriametrics"
        }
        ports = [
          "http"
        ]
      }
      resources {
        cpu    = 500  # MHz
        memory = 2048 # MB
      }
      service {
        name = "victoriametrics"
        tags = [
          "prometheus_monitoring",
          "traefik.enable=true",
          "traefik.http.routers.victoriametrics.entryPoints=https",
          "traefik.http.routers.victoriametrics.tls=true",
          "traefik.http.routers.victoriametrics.rule=Host(`victoriametrics.<redacted>`)",
          "traefik.http.routers.victoriametrics.middlewares=compression@file,secureheaders@file,local-access-only@file",
          "traefik.tags=host_${attr.unique.hostname}"
        ]
        port = "http"
        check {
          name     = "victoriametrics-check"
          type     = "http"
          method   = "GET"
          path     = "/health"
          interval = "20s"
          timeout  = "5s"
          check_restart {
            limit = 3
            grace = "10s"
          }
        }
      }
    }
  }
}

Nomad Client logs (if appropriate)

I can see this error in the GUI and in the logs:

Dez 30 11:29:29 evelyn nomad[3146]:     2020-12-30T11:29:29.308+0100 [ERROR] client.alloc_runner.task_runner: update hook failed: alloc_id=68a48255-bbea-7801-7ab0-22f49f05e97a task=victoriametrics-task name=consul_services error="error getting address for check "victoriametrics-check": invalid port "http": port label not found"

The text was updated successfully, but these errors were encountered:

apollo13 · 2021-01-04T08:33:28Z

Can you try if the errors are gone if you move the service stanza also up to the group level?

shoenig · 2021-01-04T15:56:45Z

Hi @MorphBonehunter , I think this job is actually not quite valid - the group.network stanza is trying to map a port using to, but that can only be used in bridge mode.
https://www.nomadproject.io/docs/job-specification/network#to

MorphBonehunter · 2021-01-04T16:18:14Z

@shoenig this is the case.
As you can see i use the docker driver without specific network mode setting and so this defaults to bridge.
As the behavior only happens sporadic i guess this is more an race condition than an configuration problem.
Also the nomad + docker part seems to work when this occurs, its the consul part which fail.

MorphBonehunter · 2021-01-06T20:54:09Z

Today i encounter the problem again.
I deployd an new job and upgrade it after a while with an change to the http check path and i got again:

Jan  6 16:27:11 argos nomad[766]:     2021-01-06T16:27:11.226+0100 [ERROR] client.alloc_runner.task_runner: update hook failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task name=consul_services error="error getting address for check "assets-check": invalid port "http": port label not found"

This time it was also followed from some other messages which seems strange:

Jan  6 16:28:32 argos consul[596]:     2021-01-06T16:28:32.204+0100 [ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/register from=127.0.0.1:41498 error="ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist"
Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.205+0100 [WARN]  consul.sync: failed to update services in Consul: error="Unexpected response code: 500 (ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist)"
Jan  6 16:28:32 argos consul[596]:     2021-01-06T16:28:32.213+0100 [ERROR] agent.http: Request error: method=PUT url=/v1/agent/check/register from=127.0.0.1:41498 error="ServiceID "_nomad-task-86bc96db-adf7-6923-ca5f-8cebb4f3b977-assets-task-assets-http" does not exist"

Which seems to be clear as the registration doesn't work.
After a while the health timeout happens and the allocation was restarted.
This time it was broken completely:

Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.730+0100 [ERROR] client.driver_mgr.docker: failed to create container configuration: driver=docker image_name=registry.<redacted>/webserver:2021-01-06-15-54 image_id=sha256:8f1521bf9c9b23a0857dd5aea6fd9749717e0860e3b9c16ca3c8d90f41f58b7f error="Port "http" not found, check network stanza"
Jan  6 16:28:32 argos nomad[766]:     2021-01-06T16:28:32.732+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task error="Failed to create container configuration for image "registry.<redacted>/webserver:2021-01-06-15-54" ("sha256:8f1521bf9c9b23a0857dd5aea6fd9749717e0860e3b9c16ca3c8d90f41f58b7f"): Port "http" not found, check network stanza"

Restarting the task doesn't work, so i have to resubmit the job with an minor change (CPU MHz) to get it working again.
But after all my log is full with msg like:

Jan 06 21:56:26 argos nomad[766]:     2021-01-06T21:56:26.879+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:27 argos nomad[766]:     2021-01-06T21:56:27.775+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:28 argos nomad[766]:     2021-01-06T21:56:28.676+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:29 argos nomad[766]:     2021-01-06T21:56:29.576+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082
Jan 06 21:56:30 argos nomad[766]:     2021-01-06T21:56:30.476+0100 [WARN]  consul.health: watched check not found in Consul: check=assets-check check_id=_nomad-check-2faf5e73bef7ac597aa9ca8a240a13d901442082

Even after Node drain this was spammed so i had to restart the drained nodes nomad.
After restart i found the following in the log (beware, the node was still drained and has 0 allocs):

Jan 06 21:56:57 argos nomad[263003]: ==> Nomad agent started! Log data will stream in below:
Jan 06 21:56:58 argos nomad[263003]:     2021-01-06T21:56:58.842+0100 [ERROR] client.alloc_runner.task_runner: poststart failed: alloc_id=86bc96db-adf7-6923-ca5f-8cebb4f3b977 task=assets-task error="1 error occurred:
Jan 06 21:56:58 argos nomad[263003]:         * poststart hook "consul_services" failed: unable to get address for service "assets": invalid port "http": port label not found
Jan 06 21:56:58 argos nomad[263003]: "

drewbailey · 2021-01-08T16:08:23Z

@MorphBonehunter I believe this issue is fixed and covered by #9736

You should be able to reliably reproduce this by causing an inplace update to occur (updating a service stanza value). A destructive update like updating an environment variable will create a new allocation and avoid the issue.

#9736 prevents inplace updates from nullifying the tasks ports, which eventually cause the error logs you have shared.

github-actions · 2022-10-25T02:45:05Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added theme/consul type/bug and removed type/bug labels Jan 4, 2021

shoenig added the stage/waiting-reply label Jan 4, 2021

drewbailey closed this as completed Jan 11, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul service registration failed - invalid port / port label not found #9709

consul service registration failed - invalid port / port label not found #9709

MorphBonehunter commented Dec 30, 2020 •

edited

Loading

apollo13 commented Jan 4, 2021

shoenig commented Jan 4, 2021

MorphBonehunter commented Jan 4, 2021 •

edited

Loading

MorphBonehunter commented Jan 6, 2021 •

edited

Loading

drewbailey commented Jan 8, 2021

github-actions bot commented Oct 25, 2022

consul service registration failed - invalid port / port label not found #9709

consul service registration failed - invalid port / port label not found #9709

Comments

MorphBonehunter commented Dec 30, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

apollo13 commented Jan 4, 2021

shoenig commented Jan 4, 2021

MorphBonehunter commented Jan 4, 2021 • edited Loading

MorphBonehunter commented Jan 6, 2021 • edited Loading

drewbailey commented Jan 8, 2021

github-actions bot commented Oct 25, 2022

MorphBonehunter commented Dec 30, 2020 •

edited

Loading

MorphBonehunter commented Jan 4, 2021 •

edited

Loading

MorphBonehunter commented Jan 6, 2021 •

edited

Loading