Nomad 0.12.4 stop allocation issue #8866

roman-vynar · 2020-09-10T13:10:32Z

Nomad version

0.12.4 (both servers/clients)

Issue

I have a "service" type job running with count=1.
When I stop an alloc, Nomad starts 2 new allocs instead.

It was ok on 0.12.1.
Reverted back to 0.12.1 (servers/clients) and the behaviour changed back to normal - you stop an alloc and Nomad starts 1 alloc.

notnoop · 2020-09-10T13:21:51Z

Hi @roman-vynar ! Thanks for reaching out! I'm investigating this but sadly unable to reproduce it - can you please provide more detailed instructions along with sample output and logs?

Here is my attempt at reproduction - not that it only has a single running alloc at the end.

Script

mars-2:aa notnoop$ nomad job init --short
Example job file written to example.nomad
mars-2:aa notnoop$ nomad job run ./example.nomad
==> Monitoring evaluation "dd48655b"
    Evaluation triggered by job "example"
    Allocation "1c3f8de9" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "1c3f8de9" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd48655b" finished with status "complete"
mars-2:aa notnoop$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-09-10T09:18:13-04:00
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
1c3f8de9  ce7fbaff  cache       0        run      running  12s ago  1s ago
mars-2:aa notnoop$ nomad alloc stop 1c3f8de9
==> Monitoring evaluation "141f5057"
    Evaluation triggered by job "example"
    Allocation "c08b66d5" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "c08b66d5" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "141f5057" finished with status "complete"
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
c08b66d5  ce7fbaff  cache       0        run      running   9s ago   8s ago
1c3f8de9  ce7fbaff  cache       0        stop     complete  34s ago  8s ago

roman-vynar · 2020-09-10T13:48:20Z

Back to 0.12.4. It is constantly reproducible:

$ nomad node status -verbose
ID                                    DC     Name              Class   Address    Version  Drain  Eligibility  Status
d3e83667-d812-69f8-b279-4352e115aafd  roman  nomad-10-0-7-219  <none>  127.0.0.1  0.12.4   false  eligible     ready
866b33a1-6826-3ac4-19a9-24ac4413b9c5  roman  nomad-10-0-7-165  <none>  127.0.0.1  0.12.4   false  eligible     ready
c0b8877e-6925-ae7b-bb19-103594acce97  roman  nomad-10-0-7-233  <none>  127.0.0.1  0.12.4   false  eligible     ready
991c40ba-40d9-73bc-f642-37562934002d  roman  nomad-10-0-7-171  <none>  127.0.0.1  0.12.4   false  eligible     ready
c4db3ccb-e9c3-f451-2812-f844935cbf98  roman  nomad-10-0-7-87   <none>  127.0.0.1  0.12.4   false  eligible     ready
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         1        0       16        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
9bf9521e  c0b8877e  ax-man      59       run      running  33m28s ago  33m13s ago
$ nomad alloc stop 9bf9521e
==> Monitoring evaluation "bc8db7f9"
    Evaluation triggered by job "ax-man"
    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc8db7f9" finished with status "complete"
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   13s ago     13s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   13s ago     13s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  34m24s ago  13s ago

Notice the following message from stop command:

    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"

Job def specs what applicable:

  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = false
    auto_promote = true
    canary = 1
  }

    count = 1

    restart {
      attempts = 0
      interval = "30m"
      delay = "30s"
      mode = "fail"
    }

    reschedule {
      unlimited      = true
      delay          = "15s"
      delay_function = "exponential"
      max_delay      = "5m"
    }

      driver = "docker"

5m later it is still running:

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   5m14s ago   5m1s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   5m14s ago   5m2s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  39m25s ago  5m14s ago

Please let me know if you need anything else.
Thanks for quick response!

notnoop · 2020-09-10T19:24:33Z

Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.

github-actions · 2022-11-02T02:42:08Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added stage/needs-investigation theme/scheduling type/bug labels Sep 10, 2020

notnoop self-assigned this Sep 10, 2020

notnoop mentioned this issue Sep 10, 2020

scheduler: Revert requireCanary logic #8867

Merged

notnoop added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Sep 10, 2020

notnoop closed this as completed in #8867 Sep 15, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad 0.12.4 stop allocation issue #8866

Nomad 0.12.4 stop allocation issue #8866

roman-vynar commented Sep 10, 2020

notnoop commented Sep 10, 2020

roman-vynar commented Sep 10, 2020

notnoop commented Sep 10, 2020

github-actions bot commented Nov 2, 2022

Nomad 0.12.4 stop allocation issue #8866

Nomad 0.12.4 stop allocation issue #8866

Comments

roman-vynar commented Sep 10, 2020

Nomad version

Issue

notnoop commented Sep 10, 2020

roman-vynar commented Sep 10, 2020

notnoop commented Sep 10, 2020

github-actions bot commented Nov 2, 2022