Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 0.12.4 stop allocation issue #8866

Closed
roman-vynar opened this issue Sep 10, 2020 · 4 comments · Fixed by #8867
Closed

Nomad 0.12.4 stop allocation issue #8866

roman-vynar opened this issue Sep 10, 2020 · 4 comments · Fixed by #8867
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/bug

Comments

@roman-vynar
Copy link
Contributor

Nomad version

0.12.4 (both servers/clients)

Issue

I have a "service" type job running with count=1.
When I stop an alloc, Nomad starts 2 new allocs instead.

It was ok on 0.12.1.
Reverted back to 0.12.1 (servers/clients) and the behaviour changed back to normal - you stop an alloc and Nomad starts 1 alloc.

@notnoop
Copy link
Contributor

notnoop commented Sep 10, 2020

Hi @roman-vynar ! Thanks for reaching out! I'm investigating this but sadly unable to reproduce it - can you please provide more detailed instructions along with sample output and logs?

Here is my attempt at reproduction - not that it only has a single running alloc at the end.

Script
mars-2:aa notnoop$ nomad job init --short
Example job file written to example.nomad
mars-2:aa notnoop$ nomad job run ./example.nomad
==> Monitoring evaluation "dd48655b"
    Evaluation triggered by job "example"
    Allocation "1c3f8de9" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "1c3f8de9" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd48655b" finished with status "complete"
mars-2:aa notnoop$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2020-09-10T09:18:13-04:00
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
1c3f8de9  ce7fbaff  cache       0        run      running  12s ago  1s ago
mars-2:aa notnoop$ nomad alloc stop 1c3f8de9
==> Monitoring evaluation "141f5057"
    Evaluation triggered by job "example"
    Allocation "c08b66d5" created: node "ce7fbaff", group "cache"
    Evaluation within deployment: "5237c8eb"
    Allocation "c08b66d5" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "141f5057" finished with status "complete"
mars-2:aa notnoop$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-09-10T09:18:13-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 5237c8eb
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        1       1        0          2020-09-10T09:28:23-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
c08b66d5  ce7fbaff  cache       0        run      running   9s ago   8s ago
1c3f8de9  ce7fbaff  cache       0        stop     complete  34s ago  8s ago

@roman-vynar
Copy link
Contributor Author

Back to 0.12.4. It is constantly reproducible:

$ nomad node status -verbose
ID                                    DC     Name              Class   Address    Version  Drain  Eligibility  Status
d3e83667-d812-69f8-b279-4352e115aafd  roman  nomad-10-0-7-219  <none>  127.0.0.1  0.12.4   false  eligible     ready
866b33a1-6826-3ac4-19a9-24ac4413b9c5  roman  nomad-10-0-7-165  <none>  127.0.0.1  0.12.4   false  eligible     ready
c0b8877e-6925-ae7b-bb19-103594acce97  roman  nomad-10-0-7-233  <none>  127.0.0.1  0.12.4   false  eligible     ready
991c40ba-40d9-73bc-f642-37562934002d  roman  nomad-10-0-7-171  <none>  127.0.0.1  0.12.4   false  eligible     ready
c4db3ccb-e9c3-f451-2812-f844935cbf98  roman  nomad-10-0-7-87   <none>  127.0.0.1  0.12.4   false  eligible     ready
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         1        0       16        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
9bf9521e  c0b8877e  ax-man      59       run      running  33m28s ago  33m13s ago
$ nomad alloc stop 9bf9521e
==> Monitoring evaluation "bc8db7f9"
    Evaluation triggered by job "ax-man"
    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc8db7f9" finished with status "complete"
$ nomad job status ax-man
ID            = ax-man
Name          = ax-man
Submit Date   = 2020-09-10T16:07:59+03:00
Type          = service
Priority      = 50
Datacenters   = roman
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   13s ago     13s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   13s ago     13s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  34m24s ago  13s ago

Notice the following message from stop command:

    Allocation "1769c470" created: node "c0b8877e", group "ax-man"
    Allocation "d6c7aecc" created: node "c4db3ccb", group "ax-man"

Job def specs what applicable:

  update {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = false
    auto_promote = true
    canary = 1
  }

    count = 1

    restart {
      attempts = 0
      interval = "30m"
      delay = "30s"
      mode = "fail"
    }

    reschedule {
      unlimited      = true
      delay          = "15s"
      delay_function = "exponential"
      max_delay      = "5m"
    }

      driver = "docker"

5m later it is still running:

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ax-man      0       0         2        0       17        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1769c470  c0b8877e  ax-man      59       run      running   5m14s ago   5m1s ago
d6c7aecc  c4db3ccb  ax-man      59       run      running   5m14s ago   5m2s ago
9bf9521e  c0b8877e  ax-man      59       stop     complete  39m25s ago  5m14s ago

Please let me know if you need anything else.
Thanks for quick response!

@notnoop
Copy link
Contributor

notnoop commented Sep 10, 2020

Thank you very much for the report. I have confirmed the problem and pushed a solution to rollback to old canary behavior to avoid this regression. It seems that it disproportionately affect single alloc service jobs with canary deployments.

@notnoop notnoop added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Sep 10, 2020
@github-actions
Copy link

github-actions bot commented Nov 2, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants