Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service not registered in consul when service block is added as job update #9707

Closed
ku1ik opened this issue Dec 25, 2020 · 4 comments · Fixed by #9720
Closed

Service not registered in consul when service block is added as job update #9707

ku1ik opened this issue Dec 25, 2020 · 4 comments · Fixed by #9720
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug

Comments

@ku1ik
Copy link

ku1ik commented Dec 25, 2020

Nomad version

Nomad v1.0.1 (c9c68aa)

Consul version

Consul v1.9.0
Revision a417fe510

Operating system and Environment details

I originally stumbled upon this problem on a cluster running on Ubuntu 20.04
I can reproduce it on macOS Catalina 10.15.7 (the steps below)

Issue

When the job has been running without a service definition, adding a service block and re-submitting the job has no effect - the service is not registered in consul.

Logs don't contain any warnings nor errors, it looks like there's no attempt of service registration at all.

Reproduction steps

Start consul: consul agent -dev
Start nomad: nomad agent -dev
Run the job file below: nomad run echo.nomad
Verify the service has not been registered in consul, as expected.
Uncomment the service block, and submit the updated job: nomad run echo.nomad
Verify the service has not been registered in consul, while it should have been.

I also tested the reverse scenario:

I first submitted a fresh job (previously not existing in nomad cluster) with the service block. It has been successfully registered in consul. Then I removed the service block from the job spec and re-submitted the job. The service has been properly removed from consul. So it seems only registration is affected by this bug.

Job file (if appropriate)

job "echo" {
  datacenters = ["dc1"]

  group "web" {
    network {
      port "http" {
        to = 5678
      }
    }

    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args  = ["-text", "hello world"]
        ports = ["http"]
      }

    //   service {
    //     name = "echo"
    //     port = "http"
    //   }
    }
  }
}
@ku1ik
Copy link
Author

ku1ik commented Jan 1, 2021

Found more issues with this.

I've submitted a new job with the service block in place from the start:

job "echo" {
  datacenters = ["dc1"]

  group "web" {
    network {
      port "http" {
        to = 5678
      }
    }

    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args  = ["-text", "hello world"]
        ports = ["http"]
      }

      service {
        name = "echo"
        port = "http"
      }
    }
  }
}

That worked fine, service echo got registered in consul.

Then I added a health check to the job spec, and submitted this updated job (only check block is new here):

job "echo" {
  datacenters = ["dc1"]

  group "web" {
    network {
      port "http" {
        to = 5678
      }
    }

    task "server" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args  = ["-text", "hello world"]
        ports = ["http"]
      }

      service {
        name = "echo"
        port = "http"

        check {
          name     = "http port alive"
          type     = "http"
          path     = "/"
          interval = "30s"
          timeout  = "2s"
        }
      }
    }
  }
}

Health check was not added, and instead the error was logged:

client.alloc_runner.task_runner: update hook failed: alloc_id=dea08d77-0513-c8d2-dd6a-1b7924791171 task=server name=consul_services error="error getting address for check "http port alive": invalid port "http": port label not found"

Full related nomad log output:

    2021-01-02T00:16:33.914+0100 [DEBUG] worker: dequeued evaluation: eval_id=d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a
    2021-01-02T00:16:33.914+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?index=17 duration=28.841673045s
    2021-01-02T00:16:33.914+0100 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=1.260682ms
    2021-01-02T00:16:33.914+0100 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a job_id=echo namespace=default results="Total changes: (place 0) (destructive 0) (inplace 1) (stop 0)
Created Deployment: "be9b850b-9c91-b6a4-f416-b995564c5c97"
Desired Changes for "web": (place 0) (inplace 1) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2021-01-02T00:16:33.915+0100 [DEBUG] worker: submitted plan for evaluation: eval_id=d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a
    2021-01-02T00:16:33.915+0100 [DEBUG] worker.service_sched: setting eval status: eval_id=d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a job_id=echo namespace=default status=complete
    2021-01-02T00:16:33.916+0100 [DEBUG] worker: updated evaluation: eval="<Eval "d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a" JobID: "echo" Namespace: "default">"
    2021-01-02T00:16:33.916+0100 [DEBUG] worker: ack evaluation: eval_id=d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a
    2021-01-02T00:16:33.916+0100 [DEBUG] http: request complete: method=GET path=/v1/evaluation/d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a duration=627.516µs
    2021-01-02T00:16:33.916+0100 [DEBUG] client: updated allocations: index=21 total=1 pulled=1 filtered=0
    2021-01-02T00:16:33.916+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=0
    2021-01-02T00:16:33.916+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=0 errors=0
    2021-01-02T00:16:33.917+0100 [ERROR] client.alloc_runner.task_runner: update hook failed: alloc_id=dea08d77-0513-c8d2-dd6a-1b7924791171 task=server name=consul_services error="error getting address for check "http port alive": invalid port "http": port label not found"
    2021-01-02T00:16:33.918+0100 [DEBUG] http: request complete: method=GET path=/v1/evaluation/d8e9a96f-9cc4-3ffe-3eaf-a4dc7697452a/allocations duration=604.599µs
    2021-01-02T00:16:34.002+0100 [DEBUG] client: updated allocations: index=23 total=1 pulled=0 filtered=1
    2021-01-02T00:16:34.002+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2021-01-02T00:16:34.002+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0

@shoenig shoenig self-assigned this Jan 4, 2021
@shoenig
Copy link
Member

shoenig commented Jan 4, 2021

Thanks for reporting, @sickill. Looks like the problem is we try to optimize out the service registration task runner hook if the initial task doesn't have any services.

https://github.com/hashicorp/nomad/blob/v1.0.1/client/allocrunner/taskrunner/task_runner_hooks.go#L105

This of course prevents newly defined services from being registered, as this is the hook that handles service registration updates as well.

Group level services don't have this bug, as the equivleent group service alloc runner hook is always enabled.

@shoenig shoenig added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jan 4, 2021
shoenig added a commit that referenced this issue Jan 4, 2021
Previously, Nomad would optimize out the services task runner
hook for tasks which were initially submitted with no services
defined. This causes a problem when the job is later updated to
include service(s) on that task, which will result in nothing
happening because the hook is not present to handle the service
registration in the .Update.

Instead, always enable the services hook. The group services
alloc runner hook is already always enabled.

Fixes #9707
shoenig added a commit that referenced this issue Jan 5, 2021
Previously, Nomad would optimize out the services task runner
hook for tasks which were initially submitted with no services
defined. This causes a problem when the job is later updated to
include service(s) on that task, which will result in nothing
happening because the hook is not present to handle the service
registration in the .Update.

Instead, always enable the services hook. The group services
alloc runner hook is already always enabled.

Fixes #9707
@ku1ik
Copy link
Author

ku1ik commented Jan 5, 2021

🎉

backspace pushed a commit that referenced this issue Jan 22, 2021
Previously, Nomad would optimize out the services task runner
hook for tasks which were initially submitted with no services
defined. This causes a problem when the job is later updated to
include service(s) on that task, which will result in nothing
happening because the hook is not present to handle the service
registration in the .Update.

Instead, always enable the services hook. The group services
alloc runner hook is already always enabled.

Fixes #9707
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants