Disconnected allocations are not replaced during a canary deployment once resources are made available #16644

lgfa29 · 2023-03-24T22:34:50Z

Nomad version

Nomad v1.5.2
BuildDate 2023-03-21T22:54:38Z
Revision 9a2fdb5f53dce81edf2802f0b64962e07596fd03

Operating system and Environment details

MacOS

Issue

During a canary deployment that is pending promotion, if a client that has allocations with max_client_disconnect set is disconnected, and there are no available capacity to replace these allocations, the allocations are never replaced even when capacity is made available later on.

Reproduction steps

Start 1 client and 1 server.

name      = "server1"
data_dir  = "/tmp/nomad/server1"

server {
  enabled          = 1
  bootstrap_expect = 1
}

client {
  enabled = false
}

name      = "client1"
data_dir  = "/tmp/nomad/client1"

ports {
  http = 4656
  rpc  = 4657
  serf = 4658
}

server {
  enabled = false
}

client {
  enabled = true

  server_join {
    retry_join = ["127.0.0.1"]
  }
}

Run sample job with max_client_disconnect and an update block.

job "example" {
  group "cache" {
    max_client_disconnect = "12h"

    network {
      port "db" {
        to = 6379
      }
    }

    update {
      canary       = 1
      auto_promote = false
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }
  }
}

Update job (for example, change image to redis:6). Once deployment is healthy and awaiting promotion, stop client. Allocations will eventually be marked as unknown.

Start a second client.

name      = "client2"
data_dir  = "/tmp/nomad/client2"

ports {
  http = 4666
  rpc  = 4667
  serf = 4668
}

server {
  enabled = false
}

client {
  enabled = true

  server_join {
    retry_join = ["127.0.0.1"]
  }
}

Expected Result

Allocations are placed in the new client.

Actual Result

No allocations are placed until the deployment is either promoted or failed.

Nomad Server logs (if appropriate)

The lack of log lines is the interesting point. The createNodeEvals needs to handle this case so a node-update eval is created for the job being deployed.

    2023-03-24T18:30:23.406-0400 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1 job_id=example namespace=default worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1
  results=
  | Total changes: (place 2) (destructive 0) (inplace 0) (stop 0) (disconnect 2) (reconnect 0)
  | Desired Changes for "cache": (place 2) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

    2023-03-24T18:30:23.414-0400 [DEBUG] http: request complete: method=GET path=/v1/node/61d9836c-990a-abca-6686-f0c1470826a4 duration="171.417µs"
    2023-03-24T18:30:23.417-0400 [DEBUG] worker: created evaluation: worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 eval="<Eval \"bed88082-27f5-1bda-1fed-18a013abf3cf\" JobID: \"example\" Namespace: \"default\">"
    2023-03-24T18:30:23.417-0400 [DEBUG] worker.service_sched: found reschedulable allocs, followup eval created: eval_id=49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1 job_id=example namespace=default worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 followup_eval_id=bed88082-27f5-1bda-1fed-18a013abf3cf
    2023-03-24T18:30:23.417-0400 [TRACE] nomad: evaluating plan: plan="(eval 49ed55ce, job example, NodeAllocations: (node[61d9836c] (43732d8b example.cache[0] run) (cc86ecc3 example.cache[0] run)))"
    2023-03-24T18:30:23.433-0400 [DEBUG] worker: submitted plan for evaluation: worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 eval_id=49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1
    2023-03-24T18:30:23.433-0400 [DEBUG] worker.service_sched: setting eval status: eval_id=49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1 job_id=example namespace=default worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 status=complete
    2023-03-24T18:30:23.433-0400 [DEBUG] http: request complete: method=GET path=/v1/job/example/summary?index=25 duration=12.403996292s
    2023-03-24T18:30:23.433-0400 [DEBUG] http: request complete: method=GET path=/v1/job/example/allocations?index=27 duration=14.399747042s
    2023-03-24T18:30:23.433-0400 [DEBUG] http: request complete: method=GET path="/v1/jobs?meta=true&index=25" duration=16.118075708s
    2023-03-24T18:30:23.433-0400 [DEBUG] http: request complete: method=GET path=/v1/node/61d9836c-990a-abca-6686-f0c1470826a4/allocations?index=26 duration=5.680428792s
    2023-03-24T18:30:23.442-0400 [DEBUG] worker: updated evaluation: worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 eval="<Eval \"49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1\" JobID: \"example\" Namespace: \"default\">"
    2023-03-24T18:30:23.442-0400 [DEBUG] worker: ack evaluation: worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 eval_id=49ed55ce-87b7-8b2e-6953-6b6e49f9f8e1 type=service namespace=default job_id=example node_id=61d9836c-990a-abca-6686-f0c1470826a4 triggered_by=node-update
    2023-03-24T18:30:23.442-0400 [TRACE] worker: changed workload status: worker_id=6f88259f-4d38-b901-ebfb-a50522bf87d1 from=Scheduling to=WaitingToDequeue

---- New node registers unblocking the clients UI view ---

    2023-03-24T18:30:37.524-0400 [DEBUG] http: request complete: method=GET path=/v1/nodes?index=29 duration=14.115290916s
    2023-03-24T18:30:37.560-0400 [DEBUG] http: request complete: method=GET path=/v1/node/e82f0ff4-50cd-d8ad-be2f-5924041950d4/allocations duration="159.958µs"
    2023-03-24T18:30:37.564-0400 [DEBUG] http: request complete: method=GET path=/v1/node/e82f0ff4-50cd-d8ad-be2f-5924041950d4 duration="259.958µs"
    2023-03-24T18:30:37.565-0400 [DEBUG] http: request complete: method=GET path=/v1/node/61d9836c-990a-abca-6686-f0c1470826a4 duration="148.292µs"
    2023-03-24T18:30:37.637-0400 [DEBUG] http: request complete: method=GET path=/v1/nodes?index=35 duration=88.838959ms
    2023-03-24T18:30:37.647-0400 [DEBUG] http: request complete: method=GET path=/v1/node/e82f0ff4-50cd-d8ad-be2f-5924041950d4 duration="173.25µs"
    2023-03-24T18:30:37.647-0400 [DEBUG] http: request complete: method=GET path=/v1/node/61d9836c-990a-abca-6686-f0c1470826a4 duration="151.291µs"
    2023-03-24T18:30:43.092-0400 [DEBUG] http: request complete: method=GET path=/v1/nodes?index=36 duration=3.532947958s

The text was updated successfully, but these errors were encountered:

lgfa29 added type/bug theme/deployments stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/edge labels Mar 24, 2023

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 24, 2023

lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Mar 24, 2023

lgfa29 mentioned this issue Mar 24, 2023

scheduler: fix reconciliation of reconnecting allocs #16609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnected allocations are not replaced during a canary deployment once resources are made available #16644

Disconnected allocations are not replaced during a canary deployment once resources are made available #16644

lgfa29 commented Mar 24, 2023

Disconnected allocations are not replaced during a canary deployment once resources are made available #16644

Disconnected allocations are not replaced during a canary deployment once resources are made available #16644

Comments

lgfa29 commented Mar 24, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server logs (if appropriate)