Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad alloc_health_watcher marks alloc healthy before it even registered in consul #9175

Open
pere3 opened this issue Oct 26, 2020 · 2 comments

Comments

@pere3
Copy link

pere3 commented Oct 26, 2020

Nomad version

Nomad v0.12.3 (2db8abd)
Consul v1.0.1

Operating system and Environment details

CentOS Linux release 7.3.1611 (Core)

Issue

The issue is based on a relatively big delay between nomad client host register a new service for a rolling update, and syncing this service with a local consul which is resulting in a cases where rollout is over and done with no alive services in consul at all.
As it seems from nomad monitor -log-level=TRACE logs (attaching core elements of this logs down the line), alloc_health_watcher fires two times for a new allocs on this client and even marks them as healthy, as actual consul.sync entries appears long after this process is done.

We talked this issue on hangops slack with @nickethier, and agreed on continuing discussion in an issue thread here:
https://hangops.slack.com/archives/C1CAYV38T/p1603271878496000

We also send an email to nomad-oss-debug@hashicorp.com with a subject "nomad/consul, service registration delay, slack thread" with a full trace-log of a rollout process.

Reproduction steps

Simple rolling update.

Job file (before env substitution)

job "microservice-job" {
  datacenters = ["$DATACENTER"]

  type = "service"
  update {
    max_parallel = 1
    min_healthy_time = "10s"
  }

  group "microservice-job-web" {
    count = 3

    restart {
      mode = "delay"
      attempts = 3
      interval = "30s"
      delay = "10s"
    }

    network {
      port "http" {
        to = 80
      }
    }
    service {
      port = "http"
      task = "microservice-job-web"
      name = "microservice-job"
      tags = ["production", "containerized", "csharp"]

      check {
        type = "http"
        path = "/ping/?noresponse"
        interval = "5s"
        timeout = "2s"
        initial_status = "critical"
      }
    }

    task "microservice-job-web" {
      driver = "docker"

      shutdown_delay = "15s"

      env {
        CONSUL_HTTP_ADDR = "${meta.consul_http_addr}"
        HOST_IP = "${attr.unique.network.ip-address}"
        MCS_RUN_ENV = "PRODUCTION"
      }

      config {
        image = "docker.registry.com/microservice-job:$IMAGE_TAG"

        auth {
          username = "$DOCKER_REGISTRY_USERNAME"
          password = "$DOCKER_REGISTRY_PASSWORD"
        }

        cpu_cfs_period = 100000
        cpu_hard_limit = true

        labels = {
          application = "microservice-job"
          runtime = "csharp"
          type = "web"
        }

        args = ["WebApi.dll"]
        work_dir = "/app/web"
        ports = ["http"]
      }

      resources {
        cpu = 3072
        memory = 512
      }
    }
  }
}

Nomad Client logs

$ grep -E 'consul.sync|alloc_health_watcher' new_traces.out
2020-10-26T15:01:56.355+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:02:32.922+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=1 registered_checks=0 deregistered_checks=1
2020-10-26T15:02:42.566+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=da7072aa-6074-c068-ee05-0f65a3dd8f3f deadline="2020-10-26 15:07:42.566344723 +0300 MSK m=+444288.247218999" checks=true min_healthy_time=10s
2020-10-26T15:02:48.180+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=63c0b3e3-e48f-52a9-df7b-f4ed161ee441 deadline="2020-10-26 15:07:48.180950848 +0300 MSK m=+444293.861825126" checks=true min_healthy_time=10s
2020-10-26T15:02:49.239+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:02:53.516+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=da7072aa-6074-c068-ee05-0f65a3dd8f3f healthy=true
2020-10-26T15:02:58.952+0300 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: health set: alloc_id=63c0b3e3-e48f-52a9-df7b-f4ed161ee441 healthy=true
2020-10-26T15:03:04.796+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=1 deregistered_checks=0
2020-10-26T15:03:18.985+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=1 registered_checks=0 deregistered_checks=1
2020-10-26T15:03:33.466+0300 [DEBUG] consul.sync: sync complete: registered_services=77 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:03:48.621+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=1 deregistered_checks=0
2020-10-26T15:04:03.614+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-10-26T15:04:18.487+0300 [DEBUG] consul.sync: sync complete: registered_services=78 deregistered_services=0 registered_checks=0 deregistered_checks=0
@pere3
Copy link
Author

pere3 commented Nov 1, 2020

UP, any updates?

@tgross
Copy link
Member

tgross commented Jul 8, 2021

Doing some issue cleanup. Likely cause of this is #3935, but let's keep this open till that work is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants