scheduler: system scheduler should reconcile overlapping live allocs #16097

tgross · 2023-02-08T20:41:24Z

Under conditions that are not yet understood, it's possible for there to be overlapping live system job allocations on the same node. The scheduler does not reconcile this condition by stopping any extra allocations, and the extras also do not show up in the plan either.

When allocations need exclusive resources such as reserved ports, this results in the plan applier rejecting the plan. As far as the scheduler knows the plan is good, and because this is for a system job, the scheduler will repeatedly submit this bad plan. This results in the dreaded "plan for node rejected" loop. Although we should definitely figure out why we get into this bad state to begin with, it's the job of the scheduler to properly reconcile the cluster state even in the face of errors.

This changeset includes two major chunks of work:

Add a new ensureMaxSystemAllocCount function that enforces the invariant that there can be only a single desired-running allocation for a given system job on a given node (or a number == count for sysbatch jobs).
The tests for the system allocs reconciling code path (diffSystemAllocs) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear.

Note this doesn't completely close out the Plan For Node Rejected bug class, because we're pretty sure this can also happen with service jobs. I haven't tracked that down yet, but this PR should eliminate a large chunk of known problems we've seen on some large clusters.

Note for reviewers: you might want to read this commit-by-commit
Closes https://github.com/hashicorp/team-nomad/issues/347

scheduler/system_util.go

shoenig

(will continue tmrw)

scheduler/system_util.go

scheduler/scheduler_system_test.go

scheduler/system_util.go

shoenig

LGTM!

lgfa29

Great investigation! I think the change to materializeSystemTaskGroups introduces a bug for sysbatch jobs.

scheduler/system_util.go

scheduler/system_util_test.go

scheduler/scheduler_system_test.go

lgfa29 · 2023-02-09T18:35:46Z

Although we should definitely figure out why we get into this bad state to begin with

I don't know if it's related by #11052 reports a problem with overlapping allocs during leader election.

schmichael

Fantastic work. I had a big long comment about being defensive in the scheduler vs being defensive in rpcs and cleaning up bad state in the fsm, but I hit the back button and lost it all..

...the thrilling conclusion was that your approach in this PR seems like the optimal approach to me: bad state is caught by the scheduler, so no matter how much work we do to try to prevent it from existing, being defensive in the scheduler is always attacking the problem where it impacts the cluster.

schmichael · 2023-02-09T18:30:53Z

scheduler/system_util.go

+			// If the job is batch and finished successfully (but not yet marked
+			// terminal), the fact that the node is tainted does not mean it


Huh I'm surprised this is possible because of: https://github.com/hashicorp/nomad/blob/v1.5.0-beta.1/client/allocrunner/alloc_runner.go#L812-L843

When sending alloc updates from client->server, the client computes the alloc's ClientStatus based on the state of every task precisely so that from the scheduler's perspective all tasks completed -> alloc completed is in an atomic update and therefore an invariant.

This suggests that state transition isn't atomic/invariant... I don't mind being defensible against misbehaving clients here (as I think we should do that more!), but it does make me wonder if we should rethink the approach the linked code in the client does entirely.... perhaps the fsm should also enforce the invariant that all task states being terminal == alloc's client status is terminal?

It's entirely possible that it isn't possible to reach this state. This block was introduced at the same time as the dead code I removed above that assumes we can have terminal allocations here at all, which we can't. I added the (but not yet marked terminal) parenthetical here because I assumed it was correct, but on reflection maybe it's unreachable outside of unit tests.

Returning to this, I think we should probably address it, but I want to pull that out to a separate PR just because it's not obviously correct to remove this and I don't want to muddy up this commit with that if we have to git bisect it later.

lgfa29

After talking more about it the count behaviour for sysbatch jobs is not well specified, so we'll need to handle it in follow-up PRs

tgross · 2023-02-27T14:16:13Z

scheduler/system_util.go

+	// updated allocs are for updates of the job (in-place or destructive), so
+	// we can pick the most recent one and stop all the rest of the running
+	// allocs on the node
+	if len(result.update) > 0 {


Something has been bugging me about this PR for the last few days and I finally realized what it was: we determine destructive vs non-destructive updates after this per-node call is made and then limit the allocation changes based on MaxParallel.

If we treat the non-destructive and destructive updates the same here, I'm fairly certain we'll break the system not-really-deployments feature of MaxParallel/Stagger, which makes decisions only based on the destructive endpoints. For example, suppose we have 3 nodes with MaxParallel = 1. One of the nodes has both a destructive and non-destructive update to make, but the alloc getting the non-destructive update is newer. In a single pass, we could end up stopping the alloc with the destructive update, update the alloc with the non-destructive update, and update an alloc on another node with a destructive update. That'd be more than MaxParallel = 1 destructive updates (inasmuch as a "stop" is destructive).

I think we need to move the destructive vs non-destructive checking into the diff code and have a separate diff field for destructive vs non-destructive updates. But that'll ripple out to the generic scheduler as well. I'm going to need a little bit of time to get this locked down, so I think it's going to have to miss 1.5.0 GA.

tgross · 2023-03-08T21:45:53Z

The root cause for the original bug here will be solved in #16401. So I'm going to pull the test improvements out of this PR to land, and then see about tackling the diff changes as second body of work when I next dig into the reconciler code.

The tests for the system allocs reconciling code path (`diffSystemAllocs`) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. I've pulled this changeset out of PR #16097 so that we can merge these improvements and revisit the right approach to fix the problem in #16097 with less urgency now that the PFNR bug fix has been merged. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear. It also corrects one bit of production code that behaves fine in production because of canonicalization, but forces us to remember to set values in tests to compensate.

The system scheduler uses a separate code path for reconciliation. During the investigation into the "plan for node rejected" bug which was fixed in #16401, it was discovered this code path doesn't maintain the invariant that no more than 1 allocation per system job task group (or `count` allocations for sysbatch jobs) should be left running on a given client. While this condition should be impossible to encounter, the scheduler should be reconciling these cases. Add a new `ensureSingleSystemAlloc` function that enforces the invariant that there can be only a single desired-running allocation for a given system job on a given node.

tgross added theme/scheduling type/bug backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Feb 8, 2023

tgross added this to the 1.5.0 milestone Feb 8, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui February 8, 2023 20:46 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui February 8, 2023 21:08 View deployment

tgross force-pushed the system-sched-overlapping branch from 6805794 to 340961e Compare February 8, 2023 21:19

tgross added the theme/system-scheduler label Feb 8, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui February 8, 2023 21:25 View deployment

tgross marked this pull request as ready for review February 8, 2023 21:36

tgross requested review from schmichael, shoenig and lgfa29 February 8, 2023 21:37

tgross commented Feb 8, 2023

View reviewed changes

scheduler/system_util.go Show resolved Hide resolved

tgross commented Feb 8, 2023

View reviewed changes

scheduler/system_util.go Outdated Show resolved Hide resolved

tgross force-pushed the system-sched-overlapping branch from 340961e to 2846ee6 Compare February 9, 2023 03:06

vercel bot deployed to Preview – nomad-storybook-and-ui February 9, 2023 03:11 View deployment

shoenig reviewed Feb 9, 2023

View reviewed changes

scheduler/system_util.go Outdated Show resolved Hide resolved

scheduler/system_util.go Outdated Show resolved Hide resolved

scheduler/scheduler_system_test.go Show resolved Hide resolved

scheduler/system_util.go Show resolved Hide resolved

shoenig approved these changes Feb 9, 2023

View reviewed changes

lgfa29 suggested changes Feb 9, 2023

View reviewed changes

scheduler/system_util.go Outdated Show resolved Hide resolved

scheduler/system_util.go Outdated Show resolved Hide resolved

scheduler/system_util_test.go Outdated Show resolved Hide resolved

schmichael reviewed Feb 9, 2023

View reviewed changes

scheduler/scheduler_system_test.go Show resolved Hide resolved

schmichael approved these changes Feb 9, 2023

View reviewed changes

tgross self-assigned this Feb 9, 2023

lgfa29 approved these changes Feb 10, 2023

View reviewed changes

tgross force-pushed the system-sched-overlapping branch from 2846ee6 to dccf6c0 Compare February 10, 2023 16:01

vercel bot deployed to Preview – nomad-storybook-and-ui February 10, 2023 16:06 View deployment

tgross force-pushed the system-sched-overlapping branch from dccf6c0 to e21750e Compare February 10, 2023 18:45

vercel bot deployed to Preview – nomad-storybook-and-ui February 10, 2023 18:50 View deployment

hc-github-team-nomad-core mentioned this pull request Feb 17, 2023

Backport of docs: clarify sysbatch supports count into stable-website #16214

Merged

tgross commented Feb 27, 2023

View reviewed changes

tgross modified the milestones: 1.5.0, 1.5.x Feb 27, 2023

tgross added the backport/1.5.x backport to 1.5.x release line label Mar 1, 2023

tgross marked this pull request as draft March 1, 2023 21:52

tgross force-pushed the system-sched-overlapping branch from 661fd31 to dbed34c Compare March 2, 2023 20:32

vercel bot deployed to Preview – nomad-storybook-and-ui March 2, 2023 20:37 View deployment

tgross force-pushed the system-sched-overlapping branch from dbed34c to 82ae94c Compare March 3, 2023 20:48

vercel bot deployed to Preview – nomad-storybook-and-ui March 3, 2023 20:55 View deployment

tgross mentioned this pull request Mar 9, 2023

scheduler: refactor system util tests #16416

Merged

hc-github-team-nomad-core mentioned this pull request Mar 13, 2023

Backport of scheduler: refactor system util tests into release/1.5.x #16449

Merged

tgross force-pushed the system-sched-overlapping branch from 82ae94c to 8e7a610 Compare March 13, 2023 19:23

vercel bot deployed to Preview – nomad-storybook-and-ui March 13, 2023 19:28 View deployment

tgross force-pushed the system-sched-overlapping branch from 8e7a610 to 1172f1a Compare March 15, 2023 17:28

vercel bot deployed to Preview – nomad-storybook-and-ui March 15, 2023 17:34 View deployment

tgross modified the milestones: 1.5.x, 1.6.x Jun 23, 2023

tgross modified the milestones: 1.6.x, 1.7.x Oct 27, 2023

tgross removed their assignment Oct 27, 2023

tgross removed backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Feb 9, 2024

tgross removed this from the 1.7.x milestone Feb 12, 2024

tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: system scheduler should reconcile overlapping live allocs #16097

scheduler: system scheduler should reconcile overlapping live allocs #16097

tgross commented Feb 8, 2023 •

edited

Loading

shoenig left a comment

shoenig left a comment

lgfa29 left a comment

lgfa29 commented Feb 9, 2023

schmichael left a comment

schmichael Feb 9, 2023

tgross Feb 9, 2023

tgross Feb 16, 2023

lgfa29 left a comment

tgross Feb 27, 2023

tgross commented Mar 8, 2023

		// If the job is batch and finished successfully (but not yet marked
		// terminal), the fact that the node is tainted does not mean it

scheduler: system scheduler should reconcile overlapping live allocs #16097

Are you sure you want to change the base?

scheduler: system scheduler should reconcile overlapping live allocs #16097

Conversation

tgross commented Feb 8, 2023 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

shoenig left a comment

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

lgfa29 commented Feb 9, 2023

schmichael left a comment

Choose a reason for hiding this comment

schmichael Feb 9, 2023

Choose a reason for hiding this comment

tgross Feb 9, 2023

Choose a reason for hiding this comment

tgross Feb 16, 2023

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

tgross Feb 27, 2023

Choose a reason for hiding this comment

tgross commented Mar 8, 2023

tgross commented Feb 8, 2023 •

edited

Loading