Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of drain: use client status to determine drain is complete into release/1.4.x #16879

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #14348 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

WARNING automatic cherry-pick of commits failed. Commits will require human attention.

The below text is copied from the body of the original PR.


Fixes #14293 #12915 #9902 and #16117

If an allocation is slow to stop because of kill_timeout or shutdown_delay,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining


$ go test -v -count=1 ./nodedrain -run TestNodeDrain
=== RUN   TestNodeDrain
=== RUN   TestNodeDrain/IgnoreSystem
    node_drain_test.go:394: alloc has drained from node=168ea537 after 17.504451661s
=== RUN   TestNodeDrain/EphemeralMigrate
    node_drain_test.go:394: alloc has drained from node=e85597df after 17.014548575s
=== RUN   TestNodeDrain/KeepIneligible
=== RUN   TestNodeDrain/KillTimeout
    node_drain_test.go:204: draining node 4a75dad8-c862-784d-7bc6-bb6325b2b41d
    node_drain_test.go:215: waiting for kill_timeout to expire
    node_drain_test.go:228: waiting for migration to complete
    node_drain_test.go:394: alloc has drained from node=4a75dad8 after 17.018613647s
=== RUN   TestNodeDrain/DeadlineFlag
    node_drain_test.go:261: draining nodes e85597df-2f1c-ae83-168b-d4d8719a47a6, e85597df-2f1c-ae83-168b-d4d8719a47a6
    node_drain_test.go:284: waiting for old allocs to stop
    node_drain_test.go:287: waiting for running allocs
=== RUN   TestNodeDrain/ForceFlag
    node_drain_test.go:308: draining nodes 4a75dad8-c862-784d-7bc6-bb6325b2b41d, 168ea537-1b9e-7ec8-b2c2-53064ce427d8
    node_drain_test.go:329: waiting for old allocs to stop
    node_drain_test.go:332: waiting for running allocs
--- PASS: TestNodeDrain (112.08s)
    --- PASS: TestNodeDrain/IgnoreSystem (22.05s)
    --- PASS: TestNodeDrain/EphemeralMigrate (21.46s)
    --- PASS: TestNodeDrain/KeepIneligible (1.26s)
    --- PASS: TestNodeDrain/KillTimeout (44.69s)
    --- PASS: TestNodeDrain/DeadlineFlag (12.01s)
    --- PASS: TestNodeDrain/ForceFlag (10.42s)
PASS
ok      github.com/hashicorp/nomad/e2e/nodedrain        112.084s

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/node-drain-early-complete/illegally-relieved-goshawk branch from 0dd3fe6 to d442087 Compare April 13, 2023 12:55
@hashicorp-cla
Copy link

hashicorp-cla commented Apr 13, 2023

CLA assistant check
All committers have signed the CLA.

If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
@tgross tgross force-pushed the backport/node-drain-early-complete/illegally-relieved-goshawk branch from 42bef9c to c620d97 Compare April 13, 2023 13:05
@tgross tgross merged commit 33e6070 into release/1.4.x Apr 13, 2023
@tgross tgross deleted the backport/node-drain-early-complete/illegally-relieved-goshawk branch April 13, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants