Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify node drain deadline interaction with task kill_timeout #9902

Closed
notnoop opened this issue Jan 27, 2021 · 1 comment · Fixed by #16868
Closed

Clarify node drain deadline interaction with task kill_timeout #9902

notnoop opened this issue Jan 27, 2021 · 1 comment · Fixed by #16868
Assignees
Labels
theme/docs Documentation issues and enhancements theme/drain
Milestone

Comments

@notnoop
Copy link
Contributor

notnoop commented Jan 27, 2021

The node drain doc indicates that allocations still remaining after deadline will be forced removed from the node. This hint that the allocations will be killed quickly, yet in practice the allocations are signaled to shutdown and wait the entire kill_timeout interval before stopping and the replacement alloc stopped.

We ought to clarify what the expected behavior is, document it, and/or fix it if we decide that current behavior is lacking. Also, update nodedrain e2e tests disabled in #9903.

Steps

  1. Start a multi-client cluster
  2. Submit drain_deadline.nomad
  3. Run nomad node drain enable -enable -deadline 10s -yes <node_id_where_alloc_fall>
  4. Watch the alloc shutting down, it'll take 8 minutes before it shutdown. Note that replacement alloc doesn't get marked as received by the other client until the first alloc dies.
Sample script output
$ nomad node status cbb5c9f2
ID              = cbb5c9f2-18b8-b1db-9453-3bca73aa5d49
Name            = ip-172-31-10-191
Class           = <none>
DC              = dc2
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 1h6m45s
Host Volumes    = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,mock_driver,raw_exec

Node Events
Time                  Subsystem  Message
2021-01-26T19:21:46Z  Drain      Node drain complete
2021-01-26T19:21:40Z  Drain      Node drain strategy set
2021-01-26T19:02:32Z  Drain      Node drain complete
2021-01-26T19:02:31Z  Drain      Node drain strategy set
2021-01-26T18:54:35Z  Cluster    Node registered

Allocated Resources
CPU           Memory           Disk
256/5000 MHz  128 MiB/3.8 GiB  300 MiB/4.7 GiB

Allocation Resource Utilization
CPU         Memory
0/5000 MHz  0 B/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/5000 MHz  272 MiB/3.8 GiB  3.0 GiB/7.7 GiB

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
381bd26b  cbb5c9f2  group       0        run      running   4m8s ago    1m56s ago
f03d98f1  cbb5c9f2  group       0        stop     complete  38m31s ago  36m22s ago
39d2d084  cbb5c9f2  group       0        stop     complete  57m42s ago  55m36s ago
$ nomad alloc status 381bd26b
ID                  = 381bd26b-b369-288d-38f8-a161111f870a
Eval ID             = 547c6bf0
Name                = drain_deadline.group[0]
Node ID             = cbb5c9f2
Node Name           = ip-172-31-10-191
Job ID              = drain_deadline
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m26s ago
Modified            = 2m14s ago

Task "task" is "running"
Task Resources
CPU        Memory       Disk     Addresses
0/256 MHz  0 B/128 MiB  300 MiB

Task Events:
Started At     = 2021-01-26T19:58:02Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-26T14:58:02-05:00  Started     Task started by client
2021-01-26T14:58:02-05:00  Driver      Downloading image
2021-01-26T14:58:02-05:00  Task Setup  Building Task Directory
2021-01-26T14:56:01-05:00  Received    Task received by client
$ nomad node drain -deadline 10s -detach -enable -yes cbb5c9f2
Node "cbb5c9f2-18b8-b1db-9453-3bca73aa5d49" drain strategy set
$ nomad alloc status 381bd26b
ID                   = 381bd26b-b369-288d-38f8-a161111f870a
Eval ID              = 547c6bf0
Name                 = drain_deadline.group[0]
Node ID              = cbb5c9f2
Node Name            = ip-172-31-10-191
Job ID               = drain_deadline
Job Version          = 0
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being migrated
Created              = 5m ago
Modified             = 4s ago
Replacement Alloc ID = 94348cf3

Task "task" is "running"
Task Resources
CPU        Memory       Disk     Addresses
0/256 MHz  0 B/128 MiB  300 MiB

Task Events:
Started At     = 2021-01-26T19:58:02Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-26T15:00:56-05:00  Killing     Sent interrupt. Waiting 2m0s before force killing
2021-01-26T14:58:02-05:00  Started     Task started by client
2021-01-26T14:58:02-05:00  Driver      Downloading image
2021-01-26T14:58:02-05:00  Task Setup  Building Task Directory
2021-01-26T14:56:01-05:00  Received    Task received by client
$ nomad alloc status 94348cf3
ID                  = 94348cf3-cec4-6ee5-911e-a01a8b3ed017
Eval ID             = 14460995
Name                = drain_deadline.group[0]
Node ID             = 7e3dc5e8
Node Name           = ip-172-31-14-97
Job ID              = drain_deadline
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 11s ago
Modified            = 11s ago
$ nomad alloc status 381bd26b
ID                   = 381bd26b-b369-288d-38f8-a161111f870a
Eval ID              = 547c6bf0
Name                 = drain_deadline.group[0]
Node ID              = cbb5c9f2
Node Name            = ip-172-31-10-191
Job ID               = drain_deadline
Job Version          = 0
Client Status        = complete
Client Description   = All tasks have completed
Desired Status       = stop
Desired Description  = alloc is being migrated
Created              = 7m30s ago
Modified             = 34s ago
Replacement Alloc ID = 94348cf3

Task "task" is "dead"
Task Resources
CPU        Memory       Disk     Addresses
0/256 MHz  0 B/128 MiB  300 MiB

Task Events:
Started At     = 2021-01-26T19:58:02Z
Finished At    = 2021-01-26T20:02:57Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-26T15:02:57-05:00  Killed      Task successfully killed
2021-01-26T15:02:57-05:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2021-01-26T15:00:56-05:00  Killing     Sent interrupt. Waiting 2m0s before force killing
2021-01-26T14:58:02-05:00  Started     Task started by client
2021-01-26T14:58:02-05:00  Driver      Downloading image
2021-01-26T14:58:02-05:00  Task Setup  Building Task Directory
2021-01-26T14:56:01-05:00  Received    Task received by client
$ nomad alloc status 94348cf3
ID                  = 94348cf3-cec4-6ee5-911e-a01a8b3ed017
Eval ID             = 14460995
Name                = drain_deadline.group[0]
Node ID             = 7e3dc5e8
Node Name           = ip-172-31-14-97
Job ID              = drain_deadline
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2m37s ago
Modified            = 25s ago

Task "task" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  4.0 KiB/128 MiB  300 MiB

Task Events:
Started At     = 2021-01-26T20:02:58Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-01-26T15:02:58-05:00  Started     Task started by client
2021-01-26T15:02:57-05:00  Driver      Downloading image
2021-01-26T15:02:57-05:00  Task Setup  Building Task Directory
2021-01-26T15:00:56-05:00  Received    Task received by client
@tgross tgross moved this from Done to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross added theme/docs Documentation issues and enhancements and removed stage/needs-investigation type/bug labels Jul 8, 2022
tgross added a commit that referenced this issue Apr 6, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
tgross added a commit that referenced this issue Apr 6, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
tgross added a commit that referenced this issue Apr 7, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
tgross added a commit that referenced this issue Apr 7, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
@tgross tgross self-assigned this Apr 12, 2023
@tgross tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Apr 12, 2023
Nomad - Community Issues Triage automation moved this from In Progress to Done Apr 12, 2023
tgross added a commit that referenced this issue Apr 12, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
tgross added a commit that referenced this issue Apr 12, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
tgross added a commit that referenced this issue Apr 12, 2023
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
@tgross
Copy link
Member

tgross commented Apr 12, 2023

Closed by #16823: the -deadline and -force flag for the nomad node drain command only cause the draining to ignore the migrate block's healthy deadline, max parallel, etc. These flags don't have anything to do with the kill_timeout or shutdown_delay options of the jobspec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/docs Documentation issues and enhancements theme/drain
Projects
Development

Successfully merging a pull request may close this issue.

2 participants