Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drain: use client status to determine drain is complete #14348

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented Aug 26, 2022

Fixes #14293 #12915 #9902 and #16117

If an allocation is slow to stop because of kill_timeout or shutdown_delay,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining


$ go test -v -count=1 ./nodedrain -run TestNodeDrain
=== RUN   TestNodeDrain
=== RUN   TestNodeDrain/IgnoreSystem
    node_drain_test.go:394: alloc has drained from node=168ea537 after 17.504451661s
=== RUN   TestNodeDrain/EphemeralMigrate
    node_drain_test.go:394: alloc has drained from node=e85597df after 17.014548575s
=== RUN   TestNodeDrain/KeepIneligible
=== RUN   TestNodeDrain/KillTimeout
    node_drain_test.go:204: draining node 4a75dad8-c862-784d-7bc6-bb6325b2b41d
    node_drain_test.go:215: waiting for kill_timeout to expire
    node_drain_test.go:228: waiting for migration to complete
    node_drain_test.go:394: alloc has drained from node=4a75dad8 after 17.018613647s
=== RUN   TestNodeDrain/DeadlineFlag
    node_drain_test.go:261: draining nodes e85597df-2f1c-ae83-168b-d4d8719a47a6, e85597df-2f1c-ae83-168b-d4d8719a47a6
    node_drain_test.go:284: waiting for old allocs to stop
    node_drain_test.go:287: waiting for running allocs
=== RUN   TestNodeDrain/ForceFlag
    node_drain_test.go:308: draining nodes 4a75dad8-c862-784d-7bc6-bb6325b2b41d, 168ea537-1b9e-7ec8-b2c2-53064ce427d8
    node_drain_test.go:329: waiting for old allocs to stop
    node_drain_test.go:332: waiting for running allocs
--- PASS: TestNodeDrain (112.08s)
    --- PASS: TestNodeDrain/IgnoreSystem (22.05s)
    --- PASS: TestNodeDrain/EphemeralMigrate (21.46s)
    --- PASS: TestNodeDrain/KeepIneligible (1.26s)
    --- PASS: TestNodeDrain/KillTimeout (44.69s)
    --- PASS: TestNodeDrain/DeadlineFlag (12.01s)
    --- PASS: TestNodeDrain/ForceFlag (10.42s)
PASS
ok      github.com/hashicorp/nomad/e2e/nodedrain        112.084s

@tgross tgross linked an issue Aug 26, 2022 that may be closed by this pull request
@tgross tgross self-assigned this Aug 26, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 26, 2022
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Aug 26, 2022
@tgross tgross added this to the 1.3.x milestone Aug 26, 2022
@tgross tgross modified the milestones: 1.3.x, 1.4.0 Aug 30, 2022
@tgross tgross modified the milestones: 1.4.0, 1.4.x Aug 30, 2022
@tgross tgross removed this from the 1.4.x milestone Jan 20, 2023
@tgross tgross added this to the 1.5.x milestone Mar 2, 2023
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Apr 13, 2023
@tgross tgross merged commit f91bf84 into main Apr 13, 2023
Nomad - Community Issues Triage automation moved this from In Progress to Done Apr 13, 2023
@tgross tgross deleted the node-drain-early-complete branch April 13, 2023 12:55
tgross added a commit that referenced this pull request Apr 13, 2023
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
tgross added a commit that referenced this pull request Apr 13, 2023
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
tgross added a commit that referenced this pull request Apr 13, 2023
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
tgross added a commit that referenced this pull request Apr 13, 2023
…ete) (#16880)

If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit that referenced this pull request Apr 13, 2023
…ete) (#16879)

If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit that referenced this pull request Apr 13, 2023
)

If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.

This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line theme/drain type/bug
Projects
2 participants