Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation not being stopped after promotion (using connect) #7540

Closed
jorgemarey opened this issue Mar 29, 2020 · 4 comments · Fixed by #7542
Closed

Allocation not being stopped after promotion (using connect) #7540

jorgemarey opened this issue Mar 29, 2020 · 4 comments · Fixed by #7542
Labels
theme/consul/connect Consul Connect integration type/bug

Comments

@jorgemarey
Copy link
Contributor

Nomad version

Nomad v0.10.5 (4eb2ca3)
Consul v1.7.2

Operating system and Environment details

Issue

When deploying a job with two groups using connect sidecars and canaries, if a change is made to one of its groups, after the promotion if I try to stop the job, the alloation without the change doesn't get stopped.

Reproduction steps

  1. run job file
  2. change meta version in the job (for example)
  3. run job again
  4. promote job
  5. stop job
  • The allocation will keep running

Job file (if appropriate)

I magane to reproduce this consistently using this job file:

job "test" {
  datacenters = ["dc1"]

  group "backend" {
    update {
      max_parallel     = 1
      canary           = 1
      min_healthy_time = "10s"
      healthy_deadline = "2m"
      auto_revert      = true
      auto_promote     = false
    }

    meta {
      version = "v1"
    }

    service {
      name = "demo-backend"
      port = "10080"
      connect {
        sidecar_service {}
      }
    }

    network {
      mode = "bridge"
      port "http" {
        to = 10080
      }
    }

    task "backend" {
      driver = "exec"
      config {
        command = "/bin/sleep"
        args    = ["999"]
      }
    }
  }

  group "proxy" {
    network {
      mode = "bridge"
      port "http" {
        to = 10080
      }
    }

    service {
      name = "demo-proxy"
      port = "10080"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "demo-backend"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "proxy" {
      driver = "exec"
      config {
        command = "/bin/sleep"
        args    = ["999"]
      }
    }
  }
}

Nomad logs

Using the previous job file this are the logs after the stop request.

    2020-03-29T14:29:41.692+0200 [DEBUG] http: request complete: method=DELETE path=/v1/job/test?purge=true duration=1.121622ms
    2020-03-29T14:29:41.692+0200 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=57575432-e5c4-9bdb-39dd-4489c11046c9 job_id=test namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "backend": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "proxy": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)"
    2020-03-29T14:29:41.693+0200 [DEBUG] http: request complete: method=GET path=/v1/evaluation/57575432-e5c4-9bdb-39dd-4489c11046c9 duration=231.186µs
    2020-03-29T14:29:41.694+0200 [DEBUG] worker: submitted plan for evaluation: eval_id=57575432-e5c4-9bdb-39dd-448
    9c11046c9
    2020-03-29T14:29:41.695+0200 [DEBUG] worker.service_sched: setting eval status: eval_id=57575432-e5c4-9bdb-39dd-4489c11046c9 job_id=test namespace=default status=complete
    2020-03-29T14:29:41.696+0200 [DEBUG] worker: updated evaluation: eval="<Eval "57575432-e5c4-9bdb-39dd-4489c11046c9" JobID: "test" Namespace: "default">"
    2020-03-29T14:29:41.696+0200 [DEBUG] worker: ack evaluation: eval_id=57575432-e5c4-9bdb-39dd-4489c11046c9
    2020-03-29T14:29:41.696+0200 [DEBUG] client: updated allocations: index=46 total=3 pulled=2 filtered=1
    2020-03-29T14:29:41.695+0200 [DEBUG] http: request complete: method=GET path=/v1/evaluation/57575432-e5c4-9bdb-39dd-4489c11046c9/allocations duration=264.049µs
    2020-03-29T14:29:41.696+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=1
    2020-03-29T14:29:41.696+0200 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=2 ignored=1 errors=0
    2020-03-29T14:29:41.705+0200 [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=2 registered_checks=0 deregistered_checks=0
    2020-03-29T14:29:41.725+0200 [DEBUG] client.driver_mgr.exec.executor.nomad: time="2020-03-29T14:29:41+02:00" level=warning msg="no such directory for freezer.state": alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 driver=exec task_name=backend
    2020-03-29T14:29:41.725+0200 [DEBUG] client.driver_mgr.exec.executor.nomad: time="2020-03-29T14:29:41+02:00" level=warning msg="no such directory for freezer.state": alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 driver=exec task_name=backend
    2020-03-29T14:29:41.727+0200 [DEBUG] client.driver_mgr.exec.executor: plugin process exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 driver=exec task_name=backend path=/home/jmarey/Downloads/nomad pid=3699
    2020-03-29T14:29:41.727+0200 [DEBUG] client.driver_mgr.exec.executor: plugin exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 driver=exec task_name=backend
    2020-03-29T14:29:41.731+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=backend path=/home/jmarey/Downloads/nomad pid=3688
    2020-03-29T14:29:41.731+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=backend
    2020-03-29T14:29:41.731+0200 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=backend
    2020-03-29T14:29:41.843+0200 [DEBUG] client: updated allocations: index=48 total=3 pulled=1 filtered=2
    2020-03-29T14:29:41.843+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2020-03-29T14:29:41.843+0200 [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=205d1e8f-a1a6-9f18-b91d-d053238e0b32 skipped_modify_index=46 new_modify_index=46
    2020-03-29T14:29:41.844+0200 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2020-03-29T14:29:41.890+0200 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/home/jmarey/Downloads/nomad pid=3631
    2020-03-29T14:29:41.890+0200 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2020-03-29T14:29:41.906+0200 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:72e91d8680d853b874d9aedda3a4b61048630d2043dd490ff36f5b0929f69874 references=3
    2020-03-29T14:29:41.907+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90
    2020-03-29T14:29:41.909+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=connect-proxy-demo-backend path=/home/jmarey/Downloads/nomad pid=3513
    2020-03-29T14:29:41.909+0200 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=connect-proxy-demo-backend
    2020-03-29T14:29:41.909+0200 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=6a31d992-06ae-e671-5a58-40f04a059d90 task=connect-proxy-demo-backend

    2020-03-29T14:29:42.049+0200 [DEBUG] client: updated allocations: index=49 total=3 pulled=1 filtered=2
    2020-03-29T14:29:42.049+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2020-03-29T14:29:42.049+0200 [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=205d1e8f-a1a6-9f18-b91d-d053238e0b32 skipped_modify_index=46 new_modify_index=46
    2020-03-29T14:29:42.049+0200 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2020-03-29T14:29:42.447+0200 [DEBUG] client: updated allocations: index=50 total=3 pulled=1 filtered=2
    2020-03-29T14:29:42.448+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2020-03-29T14:29:42.448+0200 [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=205d1e8f-a1a6-9f18-b91d-d053238e0b32 skipped_modify_index=46 new_modify_index=46
    2020-03-29T14:29:42.448+0200 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2020-03-29T14:29:42.700+0200 [DEBUG] http: request complete: method=GET path=/v1/evaluation/57575432-e5c4-9bdb-39dd-4489c11046c9 duration=280.304µs
    2020-03-29T14:29:42.701+0200 [DEBUG] http: request complete: method=GET path=/v1/evaluation/57575432-e5c4-9bdb-39dd-4489c11046c9/allocations duration=198.167µs

The allocation of the group proxy (that didnt require promotion) doesn't get stopped.

@jorgemarey
Copy link
Contributor Author

Hi again, just to give more information. I saw the next.

In the case I launch the job without the proxy stanza in the group the logs show (the run that deploy the canary allocation):

2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: running update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 start="2020-03-29 17:24:44.298130131 +0200 CEST m=+54.580549686"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: running update hook: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=alloc_health_watcher start="2020-03-29 17:24:44.298180206 +0200 CEST m=+54.580599730"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 deadline="2020-03-29 17:29:44.298233854 +0200 CEST m=+354.580653429" checks=true min_healthy_time=10s
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=alloc_health_watcher end="2020-03-29 17:24:44.298284768 +0200 CEST m=+54.580704326" duration=104.596µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: running update hook: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=group_services start="2020-03-29 17:24:44.29830535 +0200 CEST m=+54.580724886"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=group_services end="2020-03-29 17:24:44.298409922 +0200 CEST m=+54.580829450" duration=104.564µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: running update hook: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=consul_socket start="2020-03-29 17:24:44.298430588 +0200 CEST m=+54.580850126"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 name=consul_socket end="2020-03-29 17:24:44.298445813 +0200 CEST m=+54.580865351" duration=15.225µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 end="2020-03-29 17:24:44.298461321 +0200 CEST m=+54.580880854" duration=331.168µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: running update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=connect-proxy-demo-proxy start="2020-03-29 17:24:44.298484675 +0200 CEST m=+54.580904206"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=connect-proxy-demo-proxy name=script_checks start="2020-03-29 17:24:44.298571198 +0200 CEST m=+54.580990757"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: running update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=proxy start="2020-03-29 17:24:44.298484946 +0200 CEST m=+54.580904486"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=proxy name=script_checks start="2020-03-29 17:24:44.298686142 +0200 CEST m=+54.581105723"
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=proxy name=script_checks end="2020-03-29 17:24:44.298712913 +0200 CEST m=+54.581132454" duration=26.731µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=proxy end="2020-03-29 17:24:44.298732819 +0200 CEST m=+54.581152360" duration=247.874µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=connect-proxy-demo-proxy name=script_checks end="2020-03-29 17:24:44.298593583 +0200 CEST m=+54.581013133" duration=22.376µs
2020-03-29T17:24:44.298+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=cb95c1d8-8106-6a59-a424-3f18c7ab88a0 task=connect-proxy-demo-proxy end="2020-03-29 17:24:44.29877074 +0200 CEST m=+54.581190292" duration=286.086µs

When I launch the job with the proxy stanza (as provided in the example)

2020-03-29T17:29:39.070+0200 [TRACE] client.alloc_runner: running update hooks: alloc_id=7ff0339f-2fe0-1ef2-1cc1-1118ff3c7cc2 start="2020-03-29 17:29:39.070170671 +0200 CEST m=+67.289380347"
2020-03-29T17:29:39.070+0200 [TRACE] client.alloc_runner: running update hook: alloc_id=7ff0339f-2fe0-1ef2-1cc1-1118ff3c7cc2 name=alloc_health_watcher start="2020-03-29 17:29:39.07020945 +0200 CEST m=+67.289419142"
2020-03-29T17:29:39.070+0200 [TRACE] client.alloc_runner.runner_hook.alloc_health_watcher: watching: alloc_id=7ff0339f-2fe0-1ef2-1cc1-1118ff3c7cc2 deadline="2020-03-29 17:34:39.070225518 +0200 CEST m=+367.289435197" checks=true min_healthy_time=10s
2020-03-29T17:29:39.070+0200 [TRACE] client.alloc_runner: finished update hooks: alloc_id=7ff0339f-2fe0-1ef2-1cc1-1118ff3c7cc2 name=alloc_health_watcher end="2020-03-29 17:29:39.070270268 +0200 CEST m=+67.289479952" duration=60.81µs
2020-03-29T17:29:39.070+0200 [TRACE] client.alloc_runner: running update hook: alloc_id=7ff0339f-2fe0-1ef2-1cc1-1118ff3c7cc2 name=group_services start="2020-03-29 17:29:39.070290398 +0200 CEST m=+67.289500087"

Looks like the group_services update hook doesn't finish.

@jorgemarey
Copy link
Contributor Author

Found the problem.

The groupServiceHoop.Update method calls UpdateTask here. That method performs a mutex.Lock over the builder.

nomad/client/taskenv/env.go

Lines 508 to 513 in fce99f1

// Update task updates the environment based on a new alloc and task.
func (b *Builder) UpdateTask(alloc *structs.Allocation, task *structs.Task) *Builder {
b.mu.Lock()
defer b.mu.Unlock()
return b.setTask(task).setAlloc(alloc)
}

The setAlloc method calls SetUpstreams here that also blocks on the mutex.

nomad/client/taskenv/env.go

Lines 752 to 758 in fce99f1

// SetUpstreams defined by connect enabled group services
func (b *Builder) SetUpstreams(upstreams []structs.ConsulUpstream) *Builder {
b.mu.Lock()
b.upstreams = upstreams
b.mu.Unlock()
return b
}

This makes the call to groupSericeHook.Update block and thats blocks the update for the allocation.

@shoenig
Copy link
Member

shoenig commented Mar 30, 2020

Thanks for reporting this, and for the fix @jorgemarey !

shoenig added a commit that referenced this issue Mar 30, 2020
shoenig added a commit that referenced this issue Mar 30, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/consul/connect Consul Connect integration type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants