Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: system scheduler should reconcile overlapping live allocs #16097

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 8, 2023

Under conditions that are not yet understood, it's possible for there to be overlapping live system job allocations on the same node. The scheduler does not reconcile this condition by stopping any extra allocations, and the extras also do not show up in the plan either.

When allocations need exclusive resources such as reserved ports, this results in the plan applier rejecting the plan. As far as the scheduler knows the plan is good, and because this is for a system job, the scheduler will repeatedly submit this bad plan. This results in the dreaded "plan for node rejected" loop. Although we should definitely figure out why we get into this bad state to begin with, it's the job of the scheduler to properly reconcile the cluster state even in the face of errors.

This changeset includes two major chunks of work:

  • Add a new ensureMaxSystemAllocCount function that enforces the invariant that there can be only a single desired-running allocation for a given system job on a given node (or a number == count for sysbatch jobs).
  • The tests for the system allocs reconciling code path (diffSystemAllocs) include many impossible test environments, such as passing allocs for the wrong node into the function. This makes the test assertions nonsensible for use in walking yourself through the correct behavior. This changeset breaks up a couple of tests, expands test coverage, and makes test assertions more clear.

Note this doesn't completely close out the Plan For Node Rejected bug class, because we're pretty sure this can also happen with service jobs. I haven't tracked that down yet, but this PR should eliminate a large chunk of known problems we've seen on some large clusters.

Note for reviewers: you might want to read this commit-by-commit
Closes https://github.com/hashicorp/team-nomad/issues/347

scheduler/system_util.go Outdated Show resolved Hide resolved
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(will continue tmrw)

scheduler/system_util.go Outdated Show resolved Hide resolved
scheduler/system_util.go Outdated Show resolved Hide resolved
scheduler/scheduler_system_test.go Show resolved Hide resolved
scheduler/system_util.go Show resolved Hide resolved
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation! I think the change to materializeSystemTaskGroups introduces a bug for sysbatch jobs.

scheduler/system_util.go Outdated Show resolved Hide resolved
scheduler/system_util.go Outdated Show resolved Hide resolved
scheduler/system_util_test.go Outdated Show resolved Hide resolved
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 9, 2023

Although we should definitely figure out why we get into this bad state to begin with

I don't know if it's related by #11052 reports a problem with overlapping allocs during leader election.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work. I had a big long comment about being defensive in the scheduler vs being defensive in rpcs and cleaning up bad state in the fsm, but I hit the back button and lost it all..

...the thrilling conclusion was that your approach in this PR seems like the optimal approach to me: bad state is caught by the scheduler, so no matter how much work we do to try to prevent it from existing, being defensive in the scheduler is always attacking the problem where it impacts the cluster.

Comment on lines +147 to +153
// If the job is batch and finished successfully (but not yet marked
// terminal), the fact that the node is tainted does not mean it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I'm surprised this is possible because of: https://github.com/hashicorp/nomad/blob/v1.5.0-beta.1/client/allocrunner/alloc_runner.go#L812-L843

When sending alloc updates from client->server, the client computes the alloc's ClientStatus based on the state of every task precisely so that from the scheduler's perspective all tasks completed -> alloc completed is in an atomic update and therefore an invariant.

This suggests that state transition isn't atomic/invariant... I don't mind being defensible against misbehaving clients here (as I think we should do that more!), but it does make me wonder if we should rethink the approach the linked code in the client does entirely.... perhaps the fsm should also enforce the invariant that all task states being terminal == alloc's client status is terminal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely possible that it isn't possible to reach this state. This block was introduced at the same time as the dead code I removed above that assumes we can have terminal allocations here at all, which we can't. I added the (but not yet marked terminal) parenthetical here because I assumed it was correct, but on reflection maybe it's unreachable outside of unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to this, I think we should probably address it, but I want to pull that out to a separate PR just because it's not obviously correct to remove this and I don't want to muddy up this commit with that if we have to git bisect it later.

@tgross tgross self-assigned this Feb 9, 2023
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking more about it the count behaviour for sysbatch jobs is not well specified, so we'll need to handle it in follow-up PRs

Comment on lines +358 to +357
// updated allocs are for updates of the job (in-place or destructive), so
// we can pick the most recent one and stop all the rest of the running
// allocs on the node
if len(result.update) > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something has been bugging me about this PR for the last few days and I finally realized what it was: we determine destructive vs non-destructive updates after this per-node call is made and then limit the allocation changes based on MaxParallel.

If we treat the non-destructive and destructive updates the same here, I'm fairly certain we'll break the system not-really-deployments feature of MaxParallel/Stagger, which makes decisions only based on the destructive endpoints. For example, suppose we have 3 nodes with MaxParallel = 1. One of the nodes has both a destructive and non-destructive update to make, but the alloc getting the non-destructive update is newer. In a single pass, we could end up stopping the alloc with the destructive update, update the alloc with the non-destructive update, and update an alloc on another node with a destructive update. That'd be more than MaxParallel = 1 destructive updates (inasmuch as a "stop" is destructive).

I think we need to move the destructive vs non-destructive checking into the diff code and have a separate diff field for destructive vs non-destructive updates. But that'll ripple out to the generic scheduler as well. I'm going to need a little bit of time to get this locked down, so I think it's going to have to miss 1.5.0 GA.

@tgross
Copy link
Member Author

tgross commented Mar 8, 2023

The root cause for the original bug here will be solved in #16401. So I'm going to pull the test improvements out of this PR to land, and then see about tackling the diff changes as second body of work when I next dig into the reconciler code.

tgross added a commit that referenced this pull request Mar 9, 2023
The tests for the system allocs reconciling code path (`diffSystemAllocs`)
include many impossible test environments, such as passing allocs for the wrong
node into the function. This makes the test assertions nonsensible for use in
walking yourself through the correct behavior.

I've pulled this changeset out of PR #16097 so that we can merge these
improvements and revisit the right approach to fix the problem in #16097 with
less urgency now that the PFNR bug fix has been merged. This changeset breaks up
a couple of tests, expands test coverage, and makes test assertions more
clear. It also corrects one bit of production code that behaves fine in
production because of canonicalization, but forces us to remember to set values
in tests to compensate.
tgross added a commit that referenced this pull request Mar 13, 2023
The tests for the system allocs reconciling code path (`diffSystemAllocs`)
include many impossible test environments, such as passing allocs for the wrong
node into the function. This makes the test assertions nonsensible for use in
walking yourself through the correct behavior.

I've pulled this changeset out of PR #16097 so that we can merge these
improvements and revisit the right approach to fix the problem in #16097 with
less urgency now that the PFNR bug fix has been merged. This changeset breaks up
a couple of tests, expands test coverage, and makes test assertions more
clear. It also corrects one bit of production code that behaves fine in
production because of canonicalization, but forces us to remember to set values
in tests to compensate.
The system scheduler uses a separate code path for reconciliation. During the
investigation into the "plan for node rejected" bug which was fixed in #16401,
it was discovered this code path doesn't maintain the invariant that no more
than 1 allocation per system job task group (or `count` allocations for sysbatch
jobs) should be left running on a given client. While this condition should be
impossible to encounter, the scheduler should be reconciling these cases.

Add a new `ensureSingleSystemAlloc` function that enforces the invariant that
there can be only a single desired-running allocation for a given system job on
a given node.
@tgross tgross modified the milestones: 1.5.x, 1.6.x Jun 23, 2023
@tgross tgross modified the milestones: 1.6.x, 1.7.x Oct 27, 2023
@tgross tgross removed their assignment Oct 27, 2023
@tgross tgross removed backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Feb 9, 2024
@tgross tgross removed this from the 1.7.x milestone Feb 12, 2024
@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows theme/scheduling theme/system-scheduler type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants