Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a task name that is a prefix of another task's name can cause consul service flapping #2474

Closed
wuub opened this issue Mar 24, 2017 · 8 comments

Comments

@wuub
Copy link
Contributor

wuub commented Mar 24, 2017

Nomad version

Output from nomad version
Nomad v0.5.5
Consul v0.7.5
(issue replicated on previous versions of nomad as well)

Operating system and Environment details

Ubuntu 16.04 + test Nomad+Consul cluster

consul agent -dev
nomad agent -dev

Issue

AFAICT nomad executor of shorter named task (task1) deregisters service registered by nomad executor of longer named task (task1-sidecar) every several seconds.

Reproduction steps

  1. Launch local nomad + consul dev cluster.
  2. Nomad run attached example.nomad
  3. Launch consul watch -service=task1 -type=service cat, you'll notice service is reregistered every few seconds.
  4. Changing task "task1" to task "task1-main" works around the problem.

Nomad Consul Server logs (if appropriate)

Nomad server log is clean and does not show any signs of misbehavior. But every few seconds you'll see consul logs as such:

2017/03/24 11:36:48 [DEBUG] http: Request PUT /v1/agent/service/deregister/_nomad-executor-85c8e4ca-b4cb-e69d-90b6-432e2d1b75f8-task1-sidecar-task1 (2.835639ms) from=127.0.0.1:45526

Job file (if appropriate)

job "example" {
	datacenters = ["dc1"]
	type = "service"
	group "cache" {
		
		task "task1" {
			driver = "docker"
			config {
				image = "busybox"
				command = "nc"
				args = ["-l", "127.0.0.1:8080"]
			}
		}

		task "task1-sidecar" {
			driver = "docker"
			service {
				name = "task1"
			}
			config {
				image = "busybox"
				command = "nc"
				args = ["-l", "127.0.0.1:8080"]
			}
		}
	}
}
@schmichael
Copy link
Member

My WIP consul refactor branch will fix this. #2478
Will be in our next release and aims to be in master next week.

schmichael added a commit that referenced this issue Mar 30, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
schmichael added a commit that referenced this issue Apr 14, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
@maguec
Copy link

maguec commented Apr 18, 2017

Confirm that this also occurs with consul 0.8.1 and nomad 0.5.6 with a single task as follows:

job "tnin-job" {
  datacenters = ["infra1"]
  type = "service"
  update {
    stagger = "10s"
    max_parallel = 1
  }
  group "tnin-group" {
    count = 2
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "tnin-main" {
      driver = "docker"
      config {
        image = "maguec/tabinin:0.6"
        force_pull = true
        port_map {
          tabinin = 4000
        }
      }
      resources {
        cpu    = 400 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 10
          port "tabinin" {}
        }
      }
      env {
         NOMAD_API      = "http://nomadmaster.service.infra1:4646"
         NOMAD_CLUSTER  = "infra1"
      }
      service {
        name = "tabinin"
        tags = ["private-web-service"]
        port = "tabinin"
        check {
          name     = "alive"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}
    2017/04/18 18:16:13 [INFO] agent: Deregistered check '9069490d6fbb5f99368bee85ba0f36ea16815e5e'

schmichael added a commit that referenced this issue Apr 19, 2017
Fixes #2478 #2474 #1995 #2294

The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.

The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.

Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.

Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
@schmichael
Copy link
Member

Fixed in master by #2467. Attached is a binary with the fix built from 53eb407

If you have the time/desire/ability to test it, I'd love confirmation it fixed your particular issue.

linux_amd64.zip

@maguec
Copy link

maguec commented Apr 22, 2017

Tried running the new version:

[
  {
    "ID": "30c736ab-bb02-99e0-0380-d435de8269b6",
    "Node": "consul-0d22e702b2e5a56ad.infra1",
    "Address": "172.20.1.5",
    "TaggedAddresses": {
      "lan": "172.20.1.5",
      "wan": "172.20.1.5"
    },
    "NodeMeta": {},
    "ServiceID": "_nomad-executor-dd6590fa-503f-def7-a429-863498f8a43b-tabinin-tabinin-private-web-service",
    "ServiceName": "tabinin",
    "ServiceTags": [
      "private-web-service"
    ],
    "ServiceAddress": "172.20.216.164",
    "ServicePort": 33027,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 4853493,
    "ModifyIndex": 4853493
  }
]

The service registers properly, however if I run a second instance only one is registered in consul

from the nomad api the count is two, but only one is registered

    {
      "Name": "tabinin",
      "Count": 2,
      "Constraints": null,
      "RestartPolicy": {
        "Attempts": 10,
        "Interval": 300000000000,
        "Delay": 25000000000,
        "Mode": "delay"
      },

The above consul service entry does not change and another instance is not added

@schmichael schmichael reopened this May 4, 2017
@schmichael
Copy link
Member

@maguec I've been unable to reproduce that issue using either the build above or master. Steps to reproduce:

consul agent -dev > consul.out &
nomad agent -dev > nomad.out &
nomad init
nomad run example.nomad
sed -i -e 's/count = 1/count = 2/g'
nomad run example.nomad
curl localhost:8500/v1/catalog/service/global-redis-check # shows 2 entries

Attached a build from 499ada5

linux_amd64.zip

Please open a new bug if it is count related since this one is specifically for name prefixes (which hopefully is fixed!).

Thanks!

@maguec
Copy link

maguec commented May 12, 2017

I figured out that if I changed the nomad config to remove the remote consul server and instead run a consul server on the nomad node and leave the default fixed the issues

 consul {
-  address = "{{ consul }}:8500"
   auto_advertise = false
 }

@schmichael
Copy link
Member

@maguec Ah, fantastic! That's a requirement our docs should probably make explicit: sharing a remote Consul doesn't work.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants