Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: stopped-yet-running allocs are still running #10446

Merged
merged 7 commits into from
Sep 13, 2022

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Apr 25, 2021

This is a straw-man PR to kick the discussion going. Needs more testing; before merging.

Updates the scheduler so that it treats the alloc as running and using its resources until it truly completes as reported by the client. Typically, an alloc is terminates quickly upon being stopped, but an alloc may take a "long" time to properly shutdown and free up exclusive resources: e.g. ports, CSI/host volumes.

Previously, the scheduler optimistically treats stopped-yet-running allocs as terminated and may schedule an alloc in their place. This introduces a widow where both stopped-yet-running and new allocs are running on the client. During the window, the client resources can be oversubscribed unexpectedly (with OOM trigger) or lead to port/volume collision. Avoiding such window seems like the correct behavior.

I conjecture in a typical cluster, the schedule will simply place the alloc to a new client without user visible downside. For some constrained jobs, the alloc start may delayed until the previous alloc stopped.

An Alternative

Alternatively, the client may order alloc operations so it starts new allocs only after conflicting stopped allocs complete. Ordering the operations is more complex to implement. It also delays starting the task unnecessarily - the new alloc would have started faster had it be placed on another client without a conflicting alloc.

Fixes #10440

@tgross
Copy link
Member

tgross commented Apr 26, 2021

@notnoop how would this interact with ephemeral disk migrations? My understanding is that requires we have the alloc with a desired state of stopped on the server but without the alloc runner having stopped on the client.

@notnoop
Copy link
Contributor Author

notnoop commented Apr 28, 2021

how would this interact with ephemeral disk migrations?

I believe their logic remains untouched, but I'll test. This PR changes only changes the resource accounting for a node and whether an alloc may be placed on a node or not.

The other issue that should be tested more extensively is pre-emption. The server should discount the resources of alloc that's to be evicted, and the client should sequence it so the evicted alloc completes before the new one starts.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work coming up with a solution for #10440. I think your assessment of the bug and that this is an optimal solution is correct.

TODO

Probably obvious since this is marked Draft but just to be clear it needs at least one unit test to assert the behavior on the scheduler side, and then an e2e test to ensure a well-behaved client interacts as expected with this change.

Client-coordinated 👎

We could fix this on the client, but I don't think there's any benefits there. It would introduce a lot of tricky additional logic to the client and could slow down time-to-running for an allocation to max(kill_timeout) of every job in the cluster! That could be minutes of startup latency!

So I think client-coordinated collision detection is strictly worse than this server-coordinated approach.

Server-coordinated (this) risk

The big risk of this approach that occurs to me is that it will exacerbate forever pending issues. Currently if an allocation is stuck in ClientStatus=pending you can often stop and restart the job to get out of that tricky situation: since the allocation will be DesiredStatus=stop, the scheduler will ignore its resource utilization. This would obviously change that and ClientStatus=pending allocations would forever take up cluster resources until cleaned out.

I think we should at least consider (RFC?) escape hatches for ClientStatus=pending and/or tasks that exceed their kill_timeout before merging this PR. I don't want to make those efforts a hard blocker, but I don't want to make "poorly behaved clients" (forever pending, tasks that outlive their kill_timeout, etc) an even worse issue for operators by merging this without some consideration to remediating these extremely difficult to track down bugs.

Comment on lines -107 to 178
if alloc.TerminalStatus() {
if alloc.ClientTerminalStatus() {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this code is ever hit. The 2 places we call AllocsFit both already filter down to just non-terminal allocs as far as I can tell.

The call in the plan applier calls AllocsByNodeTerminal(false) to generate the list of allocs, so all terminal allocs are already removed AFAICT.

The call in the worker is also passed a list of allocs from AllocsByNodeTerminal(false).

AllocsByNodeTerminal itself makes this all pretty tricky to trace because it's difficult to know the provenance of any given []*Allocation and how its been populated and filtered.

I think leaving this filter logic in place is ok as a defensive measure, but I think we need to ensure the half-terminal allocs aren't filtered out "upstream."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong. This is hit in the plan applier! Tested against main (c2428e1 - v1.3.2-dev) and rebased this commit on top of it.

I missed that after the call to snap.AllocsByNodeTerminal(ws, nodeID, false) (which returns only non-terminal allocs) we append the allocs from the Plan which include the failed alloc: https://github.com/hashicorp/nomad/blob/v1.3.1/nomad/plan_apply.go#L699

To repro:

nomad agent -dev
nomad job run example.nomad
nomad alloc signal $allocid
# wait for restart
nomad alloc signal $allocid
# wait for restart
nomad alloc signal #allocid

# eval for rescheduling sees old alloc as DesiredStatus=run + ClientStatus=failed

@hashicorp-cla
Copy link

hashicorp-cla commented Nov 10, 2021

CLA assistant check
All committers have signed the CLA.

@benbuzbee
Copy link
Contributor

We have been running this patch in our cluster since it was released and it's solved this problem for us; is there anyway we can help contribute to confidence in merging something?

@schmichael
Copy link
Member

Thanks for the report @benbuzbee! It's very helpful in cases like this where a change has pretty subtle yet far reaching implications. We'll consider getting this into main.

@schmichael
Copy link
Member

schmichael commented Jun 30, 2022

This is officially a WIP that I intend to get merged. Still needs quite a bit more testing, but upon investigation I think most of my concerns were unfounded.

The big risk of this approach that occurs to me is that it will exacerbate forever pending issues.The big risk of this approach that occurs to me is that it will exacerbate forever pending issues.

I have observed cases where a forever pending replacement alloc causes a permanent plan for node rejected for a node, so at least that forever pending case would be likely be fixed by this code by virtue of not trying to schedule it to begin with.

Many other forever pending cases require operator intervention to use either alloc stop, alloc signal, or in extreme cases draining and wiping the node. I don't think the scope of this PR needs to be enlarged to try to fix all of those cases. As long as it doesn't make them worse than it's worth them for the obvious correctness benefit laid out in #10440.

Next steps are:

  • Audit all uses of TerminalStatus() in scheduler/
  • More unit tests
  • An e2e test
  • Manually testing sticky=true and/or migrate=true if e2e tests don't cover them
  • Add changelog
  • Add upgrade guide mention so users are aware of the change
  • Consider an internals/scheduler doc update

@schmichael
Copy link
Member

tl;dr -> Did some manual testing and confirmed this does not interfere with migrate=true and sticky=true. This patch does not interfere with the existing alloc replacement (due to deployments or preemption) code paths.

If you simply redeploy a job using those, the replacement allocs are created atomically (as part of the same plan) as the old allocs are created. This is the same as the old behavior and optimal as in the case of a deployment the Nomad client agents handle waiting for the original to exit before allowing the replacement to start. (Preemption also uses this for "original" and "replacement" allocs of 2 different jobs. The scheduler must make the stop-old/place-new placement atomically otherwise another placement could interleave with the stopping-of-old and placing-of-new.)

But that's not what #10440 is about or what this impacts: this is all about when the stop and placement are discrete operations. First the job is stopped. Then the job (or a job using the same resources) is started.

In this case the Nomad client agent's waiting code is completely bypassed as there's no PreviousAllocation set. From the server's perspective there's no causality between the old alloc using a resource and the new alloc wanting it...

...so Mahmood's patch closes this gap by making the scheduler calculate resource usage based on what's actually running on the client (based on ClientStatus), not what the server wants to be running on the client (DesiredStatus | ClientStatus).

@schmichael
Copy link
Member

schmichael commented Aug 23, 2022

Some attempts at trying to visually demonstrate the change:

Each box represents the DesiredStatus | ClientStatus of an allocation. This assumes there is not enough cluster capacity to start Job 2 before Job 1 stops.

Original behavior

gantt
    title Overlapping allocations
    dateFormat  x
    axisFormat %s
    section Job 1
    Registered       :j1, 0, 2s
    run | pending    :j1pending, after j1, 2s
    run | running    :j1running, after j1pending, 2s
    stop | running   :j1stopping, after j1running, 2s
    stop | stopped   :j1stopped, after j1stopping, 2s

    section Job 2
    Registered       :j2, after j1pending, 1s
    Blocked          :j2blocked, after j2, 1s
    run | pending    :j2pending, after j1running, 1s
    run | running    :j2running, after j2pending, 1s
Loading

Implemented behavior

gantt
    title Non-overlapping allocations
    dateFormat  x
    axisFormat %s
    section Job 1
    Registered       :j1, 0, 2s
    run | pending    :j1pending, after j1, 2s
    run | running    :j1running, after j1pending, 2s
    stop | running   :j1stopping, after j1running, 2s
    stop | stopped   :j1stopped, after j1stopping, 2s

    section Job 2
    Registered       :j2, after j1pending, 1s
    Blocked -->      :j2blocked, after j2, 1s
    run | pending    :j2pending, after j1stopping, 1s
    run | running    :j2running, after j2pending, 1s
Loading

@tgross tgross removed their assignment Sep 12, 2022
Mahmood Ali and others added 3 commits September 12, 2022 11:23
it's not concise... feedback welcome
testutil/wait.go Outdated
Comment on lines 17 to 34
func Wait(t *testing.T, test testFn) {
t.Helper()
retries := 500 * TestMultiplier()
for retries > 0 {
time.Sleep(10 * time.Millisecond)
retries--

success, err := test()
if success {
return
}

if retries == 0 {
t.Fatalf("timeout: %v", err)
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got sick of writing the error callback. LMK if I should stop being lazy and switch to WaitForResult. 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was tempted to write a similar helper recently, and I've had a few review comments on other folks' PRs of late where not returning an error in the return false, nil case ends up making for hard-to-debug test failures. There are ~600 uses of WaitForResult* helpers, so at some point I'd like to take a pass through and see how many we could move to this simpler helper... I think it's a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil error issue, and also I think the concept of TestMultiplier can be eliminated in favor of context/timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I added

if err == nil {
	t.Fatalf("timeout waiting for test function to succeed (you should probably return a helpful error instead of nil!)")
}

The file name and line number will take you to the beginning of the test function that returned false, nil. Since the stack frame responsible for returning false, nil ... returned ... there's really nothing else we can do here.

@schmichael
Copy link
Member

I'll write an e2e test in a followup PR because I want to get this reviewed ASAP. The job_endpoint test I wrote is unfortunately long, but it does properly exercise this behavior. When run on main @ a62a654 it properly fails at:

job_endpoint_test.go:190: timeout: expected 1 allocation but found 2:

Which indicates that the 2nd job registration placed an allocation before the old allocation had a terminal client status.

@schmichael schmichael marked this pull request as ready for review September 12, 2022 23:39
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have to approve my own changes to clear my old review status.)

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think you also need to make the same change to the DeviceAccounter?
https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

A couple of other things I looked into:

  • CPU cores will be handled by AllocsFit like other resources.
  • CSI volumes have a completely different tracking mechanism (claims) and are release via a Postrun allocrunner hook.
  • Host volume usage doesn't seem to be tracked? Or at least I couldn't find what it happens 😅

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

Comment on lines +122 to +123
s1, cleanupS1 := TestServer(t, func(c *Config) {
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s1, cleanupS1 := TestServer(t, func(c *Config) {
})
s1, cleanupS1 := TestServer(t, nil)

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tgross
Copy link
Member

tgross commented Sep 13, 2022

CSI volumes have a completely different tracking mechanism (claims) and are release via a Postrun allocrunner hook.

Interestingly, with this change and by updating the volumewatcher to query allocs and not just volumes (something I've been stewing on), we might be able to simplify the claim unpublish workflow a bit more.

@@ -7,6 +7,7 @@ import (
"testing"

"github.com/hashicorp/nomad/ci"
"github.com/stretchr/testify/assert"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just JYI the top-level test package from shoenig/test is analogous to testify/assert (marking failure without stopping)

testutil/wait.go Outdated
Comment on lines 17 to 34
func Wait(t *testing.T, test testFn) {
t.Helper()
retries := 500 * TestMultiplier()
for retries > 0 {
time.Sleep(10 * time.Millisecond)
retries--

success, err := test()
if success {
return
}

if retries == 0 {
t.Fatalf("timeout: %v", err)
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil error issue, and also I think the concept of TestMultiplier can be eliminated in favor of context/timeout

@schmichael
Copy link
Member

LGTM!

I think you also need to make the same change to the DeviceAccounter? https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

Great catch! Fixing.

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

Good call. Looking into that now as well.

@schmichael
Copy link
Member

I think you also need to make the same change to the DeviceAccounter? https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

Fixed in 102af72. Thanks!

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

Quotas broadly uses TerminalStatus() => resources are free, so it will be more permissive than the rest of the scheduler. What quotas report as in-use may be lower than what the scheduler considers as in-use while allocs are shutting down (eg when allocs are in their shutdown_delay and kill_timeout). This matches existing quotas behavior and doesn't keep this fix from preventing overlapping resource usage that can cause apps to crash, so I'm going to leave quotas alone.

If at some point in the future we find a compelling reason to make Quotas usage more precise, this will serve as useful prior art.

@schmichael schmichael merged commit 757c3c9 into main Sep 13, 2022
@schmichael schmichael deleted the b-stopping-allocs-are-running branch September 13, 2022 19:52
schmichael added a commit that referenced this pull request Sep 22, 2022
Followup to #10446

Fails (as expected) against 1.3.x at the wait for blocked eval (because
the allocs are allowed to overlap).

Passes against 1.4.0-beta.1 (as expected).
schmichael added a commit that referenced this pull request Sep 22, 2022
* test: add e2e for non-overlapping placements

Followup to #10446

Fails (as expected) against 1.3.x at the wait for blocked eval (because
the allocs are allowed to overlap).

Passes against 1.4.0-beta.1 (as expected).

* Update e2e/overlap/overlap_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allocations should not overlap in execution
7 participants