scheduler: stopped-yet-running allocs are still running #10446

notnoop · 2021-04-25T16:25:31Z

This is a straw-man PR to kick the discussion going. Needs more testing; before merging.

Updates the scheduler so that it treats the alloc as running and using its resources until it truly completes as reported by the client. Typically, an alloc is terminates quickly upon being stopped, but an alloc may take a "long" time to properly shutdown and free up exclusive resources: e.g. ports, CSI/host volumes.

Previously, the scheduler optimistically treats stopped-yet-running allocs as terminated and may schedule an alloc in their place. This introduces a widow where both stopped-yet-running and new allocs are running on the client. During the window, the client resources can be oversubscribed unexpectedly (with OOM trigger) or lead to port/volume collision. Avoiding such window seems like the correct behavior.

I conjecture in a typical cluster, the schedule will simply place the alloc to a new client without user visible downside. For some constrained jobs, the alloc start may delayed until the previous alloc stopped.

An Alternative

Alternatively, the client may order alloc operations so it starts new allocs only after conflicting stopped allocs complete. Ordering the operations is more complex to implement. It also delays starting the task unnecessarily - the new alloc would have started faster had it be placed on another client without a conflicting alloc.

Fixes #10440

tgross · 2021-04-26T12:22:05Z

@notnoop how would this interact with ephemeral disk migrations? My understanding is that requires we have the alloc with a desired state of stopped on the server but without the alloc runner having stopped on the client.

notnoop · 2021-04-28T19:12:07Z

how would this interact with ephemeral disk migrations?

I believe their logic remains untouched, but I'll test. This PR changes only changes the resource accounting for a node and whether an alloc may be placed on a node or not.

The other issue that should be tested more extensively is pre-emption. The server should discount the resources of alloc that's to be evicted, and the client should sequence it so the evicted alloc completes before the new one starts.

schmichael

Great work coming up with a solution for #10440. I think your assessment of the bug and that this is an optimal solution is correct.

TODO

Probably obvious since this is marked Draft but just to be clear it needs at least one unit test to assert the behavior on the scheduler side, and then an e2e test to ensure a well-behaved client interacts as expected with this change.

Client-coordinated 👎

We could fix this on the client, but I don't think there's any benefits there. It would introduce a lot of tricky additional logic to the client and could slow down time-to-running for an allocation to max(kill_timeout) of every job in the cluster! That could be minutes of startup latency!

So I think client-coordinated collision detection is strictly worse than this server-coordinated approach.

Server-coordinated (this) risk

The big risk of this approach that occurs to me is that it will exacerbate forever pending issues. Currently if an allocation is stuck in ClientStatus=pending you can often stop and restart the job to get out of that tricky situation: since the allocation will be DesiredStatus=stop, the scheduler will ignore its resource utilization. This would obviously change that and ClientStatus=pending allocations would forever take up cluster resources until cleaned out.

I think we should at least consider (RFC?) escape hatches for ClientStatus=pending and/or tasks that exceed their kill_timeout before merging this PR. I don't want to make those efforts a hard blocker, but I don't want to make "poorly behaved clients" (forever pending, tasks that outlive their kill_timeout, etc) an even worse issue for operators by merging this without some consideration to remediating these extremely difficult to track down bugs.

schmichael · 2021-05-04T19:25:35Z

nomad/structs/funcs.go

-		if alloc.TerminalStatus() {
+		if alloc.ClientTerminalStatus() {
 			continue
 		}


I'm not sure this code is ever hit. The 2 places we call AllocsFit both already filter down to just non-terminal allocs as far as I can tell.

The call in the plan applier calls AllocsByNodeTerminal(false) to generate the list of allocs, so all terminal allocs are already removed AFAICT.

The call in the worker is also passed a list of allocs from AllocsByNodeTerminal(false).

AllocsByNodeTerminal itself makes this all pretty tricky to trace because it's difficult to know the provenance of any given []*Allocation and how its been populated and filtered.

I think leaving this filter logic in place is ok as a defensive measure, but I think we need to ensure the half-terminal allocs aren't filtered out "upstream."

I was wrong. This is hit in the plan applier! Tested against main (c2428e1 - v1.3.2-dev) and rebased this commit on top of it.

I missed that after the call to snap.AllocsByNodeTerminal(ws, nodeID, false) (which returns only non-terminal allocs) we append the allocs from the Plan which include the failed alloc: https://github.com/hashicorp/nomad/blob/v1.3.1/nomad/plan_apply.go#L699

To repro:

nomad agent -dev nomad job run example.nomad nomad alloc signal $allocid # wait for restart nomad alloc signal $allocid # wait for restart nomad alloc signal #allocid # eval for rescheduling sees old alloc as DesiredStatus=run + ClientStatus=failed

hashicorp-cla · 2021-11-10T21:44:20Z

All committers have signed the CLA.

benbuzbee · 2022-01-06T00:33:07Z

We have been running this patch in our cluster since it was released and it's solved this problem for us; is there anyway we can help contribute to confidence in merging something?

schmichael · 2022-01-07T00:04:02Z

Thanks for the report @benbuzbee! It's very helpful in cases like this where a change has pretty subtle yet far reaching implications. We'll consider getting this into main.

schmichael · 2022-06-30T22:20:45Z

This is officially a WIP that I intend to get merged. Still needs quite a bit more testing, but upon investigation I think most of my concerns were unfounded.

The big risk of this approach that occurs to me is that it will exacerbate forever pending issues.The big risk of this approach that occurs to me is that it will exacerbate forever pending issues.

I have observed cases where a forever pending replacement alloc causes a permanent plan for node rejected for a node, so at least that forever pending case would be likely be fixed by this code by virtue of not trying to schedule it to begin with.

Many other forever pending cases require operator intervention to use either alloc stop, alloc signal, or in extreme cases draining and wiping the node. I don't think the scope of this PR needs to be enlarged to try to fix all of those cases. As long as it doesn't make them worse than it's worth them for the obvious correctness benefit laid out in #10440.

Next steps are:

Audit all uses of TerminalStatus() in scheduler/
More unit tests
An e2e test
Manually testing sticky=true and/or migrate=true if e2e tests don't cover them
Add changelog
Add upgrade guide mention so users are aware of the change
Consider an internals/scheduler doc update

schmichael · 2022-07-22T23:53:16Z

tl;dr -> Did some manual testing and confirmed this does not interfere with migrate=true and sticky=true. This patch does not interfere with the existing alloc replacement (due to deployments or preemption) code paths.

If you simply redeploy a job using those, the replacement allocs are created atomically (as part of the same plan) as the old allocs are created. This is the same as the old behavior and optimal as in the case of a deployment the Nomad client agents handle waiting for the original to exit before allowing the replacement to start. (Preemption also uses this for "original" and "replacement" allocs of 2 different jobs. The scheduler must make the stop-old/place-new placement atomically otherwise another placement could interleave with the stopping-of-old and placing-of-new.)

But that's not what #10440 is about or what this impacts: this is all about when the stop and placement are discrete operations. First the job is stopped. Then the job (or a job using the same resources) is started.

In this case the Nomad client agent's waiting code is completely bypassed as there's no PreviousAllocation set. From the server's perspective there's no causality between the old alloc using a resource and the new alloc wanting it...

...so Mahmood's patch closes this gap by making the scheduler calculate resource usage based on what's actually running on the client (based on ClientStatus), not what the server wants to be running on the client (DesiredStatus | ClientStatus).

schmichael · 2022-08-23T20:30:53Z

Some attempts at trying to visually demonstrate the change:

Each box represents the DesiredStatus | ClientStatus of an allocation. This assumes there is not enough cluster capacity to start Job 2 before Job 1 stops.

Original behavior

gantt
    title Overlapping allocations
    dateFormat  x
    axisFormat %s
    section Job 1
    Registered       :j1, 0, 2s
    run | pending    :j1pending, after j1, 2s
    run | running    :j1running, after j1pending, 2s
    stop | running   :j1stopping, after j1running, 2s
    stop | stopped   :j1stopped, after j1stopping, 2s

    section Job 2
    Registered       :j2, after j1pending, 1s
    Blocked          :j2blocked, after j2, 1s
    run | pending    :j2pending, after j1running, 1s
    run | running    :j2running, after j2pending, 1s

Implemented behavior

gantt
    title Non-overlapping allocations
    dateFormat  x
    axisFormat %s
    section Job 1
    Registered       :j1, 0, 2s
    run | pending    :j1pending, after j1, 2s
    run | running    :j1running, after j1pending, 2s
    stop | running   :j1stopping, after j1running, 2s
    stop | stopped   :j1stopped, after j1stopping, 2s

    section Job 2
    Registered       :j2, after j1pending, 1s
    Blocked -->      :j2blocked, after j2, 1s
    run | pending    :j2pending, after j1stopping, 1s
    run | running    :j2running, after j2pending, 1s

Also add a simpler Wait test helper to improve line numbers and save few lines of code.

it's not concise... feedback welcome

schmichael · 2022-09-12T23:32:13Z

testutil/wait.go

+func Wait(t *testing.T, test testFn) {
+	t.Helper()
+	retries := 500 * TestMultiplier()
+	for retries > 0 {
+		time.Sleep(10 * time.Millisecond)
+		retries--
+
+		success, err := test()
+		if success {
+			return
+		}
+
+		if retries == 0 {
+			t.Fatalf("timeout: %v", err)
+		}
+	}
+}
+


I just got sick of writing the error callback. LMK if I should stop being lazy and switch to WaitForResult. 😅

I was tempted to write a similar helper recently, and I've had a few review comments on other folks' PRs of late where not returning an error in the return false, nil case ends up making for hard-to-debug test failures. There are ~600 uses of WaitForResult* helpers, so at some point I'd like to take a pass through and see how many we could move to this simpler helper... I think it's a lot.

I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil error issue, and also I think the concept of TestMultiplier can be eliminated in favor of context/timeout

What if I added

if err == nil { t.Fatalf("timeout waiting for test function to succeed (you should probably return a helpful error instead of nil!)") }

The file name and line number will take you to the beginning of the test function that returned false, nil. Since the stack frame responsible for returning false, nil ... returned ... there's really nothing else we can do here.

schmichael · 2022-09-12T23:39:08Z

I'll write an e2e test in a followup PR because I want to get this reviewed ASAP. The job_endpoint test I wrote is unfortunately long, but it does properly exercise this behavior. When run on main @ a62a654 it properly fails at:

job_endpoint_test.go:190: timeout: expected 1 allocation but found 2:

Which indicates that the 2nd job registration placed an allocation before the old allocation had a terminal client status.

schmichael

(I have to approve my own changes to clear my old review status.)

lgfa29

LGTM!

I think you also need to make the same change to the DeviceAccounter?
https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

A couple of other things I looked into:

CPU cores will be handled by AllocsFit like other resources.
CSI volumes have a completely different tracking mechanism (claims) and are release via a Postrun allocrunner hook.
Host volume usage doesn't seem to be tracked? Or at least I couldn't find what it happens 😅

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

lgfa29 · 2022-09-13T01:26:50Z

nomad/job_endpoint_test.go

+	s1, cleanupS1 := TestServer(t, func(c *Config) {
+	})


Suggested change

s1, cleanupS1 := TestServer(t, func(c *Config) {

})

s1, cleanupS1 := TestServer(t, nil)

tgross

LGTM

tgross · 2022-09-13T14:08:27Z

CSI volumes have a completely different tracking mechanism (claims) and are release via a Postrun allocrunner hook.

Interestingly, with this change and by updating the volumewatcher to query allocs and not just volumes (something I've been stewing on), we might be able to simplify the claim unpublish workflow a bit more.

shoenig · 2022-09-13T15:31:58Z

nomad/structs/network_test.go

@@ -7,6 +7,7 @@ import (
 	"testing"

 	"github.com/hashicorp/nomad/ci"
+	"github.com/stretchr/testify/assert"


just JYI the top-level test package from shoenig/test is analogous to testify/assert (marking failure without stopping)

shoenig · 2022-09-13T15:41:47Z

testutil/wait.go

+func Wait(t *testing.T, test testFn) {
+	t.Helper()
+	retries := 500 * TestMultiplier()
+	for retries > 0 {
+		time.Sleep(10 * time.Millisecond)
+		retries--
+
+		success, err := test()
+		if success {
+			return
+		}
+
+		if retries == 0 {
+			t.Fatalf("timeout: %v", err)
+		}
+	}
+}
+


I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil error issue, and also I think the concept of TestMultiplier can be eliminated in favor of context/timeout

schmichael · 2022-09-13T17:23:17Z

LGTM!

I think you also need to make the same change to the DeviceAccounter? https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

Great catch! Fixing.

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

Good call. Looking into that now as well.

schmichael · 2022-09-13T19:08:36Z

I think you also need to make the same change to the DeviceAccounter? https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66

Fixed in 102af72. Thanks!

I haven't check quotas (yet), but that would have to be in a separate PR anyway.

Quotas broadly uses TerminalStatus() => resources are free, so it will be more permissive than the rest of the scheduler. What quotas report as in-use may be lower than what the scheduler considers as in-use while allocs are shutting down (eg when allocs are in their shutdown_delay and kill_timeout). This matches existing quotas behavior and doesn't keep this fix from preventing overlapping resource usage that can cause apps to crash, so I'm going to leave quotas alone.

If at some point in the future we find a compelling reason to make Quotas usage more precise, this will serve as useful prior art.

Followup to #10446 Fails (as expected) against 1.3.x at the wait for blocked eval (because the allocs are allowed to overlap). Passes against 1.4.0-beta.1 (as expected).

* test: add e2e for non-overlapping placements Followup to #10446 Fails (as expected) against 1.3.x at the wait for blocked eval (because the allocs are allowed to overlap). Passes against 1.4.0-beta.1 (as expected). * Update e2e/overlap/overlap_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

github-actions · 2023-01-12T02:16:08Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added the type/bug label Apr 25, 2021

notnoop requested review from schmichael and tgross April 25, 2021 16:25

notnoop self-assigned this Apr 25, 2021

notnoop mentioned this pull request Apr 25, 2021

Allocations should not overlap in execution #10440

Closed

schmichael requested changes May 4, 2021

View reviewed changes

tgross assigned tgross and unassigned notnoop May 16, 2022

schmichael force-pushed the b-stopping-allocs-are-running branch from c79161e to 53462db Compare June 30, 2022 22:13

vercel bot deployed to Preview – nomad-storybook-and-ui June 30, 2022 22:15 View deployment

schmichael modified the milestone: 1.4.0 Jun 30, 2022

schmichael mentioned this pull request Jul 6, 2022

core: emit node evals only for sys jobs in dc #12955

Merged

tgross removed their assignment Sep 12, 2022

Mahmood Ali and others added 3 commits September 12, 2022 11:23

scheduler: stopped-yet-running allocs are still running

d023f4c

scheduler: test new stopped-but-running logic

e938945

test: assert nonoverlapping alloc behavior

9d06bd6

Also add a simpler Wait test helper to improve line numbers and save few lines of code.

schmichael force-pushed the b-stopping-allocs-are-running branch from 53462db to 9d06bd6 Compare September 12, 2022 21:44

vercel bot deployed to Preview – nomad-storybook-and-ui September 12, 2022 21:47 View deployment

docs: tried my best to describe #10446

69e3a7d

it's not concise... feedback welcome

vercel bot deployed to Preview – nomad-storybook-and-ui September 12, 2022 23:20 View deployment

vercel bot deployed to Preview – nomad September 12, 2022 23:21 View deployment

schmichael reviewed Sep 12, 2022

View reviewed changes

scheduler: fix test that allowed overlapping allocs

ab5db46

vercel bot deployed to Preview – nomad-storybook-and-ui September 12, 2022 23:36 View deployment

schmichael marked this pull request as ready for review September 12, 2022 23:39

schmichael approved these changes Sep 12, 2022

View reviewed changes

lgfa29 approved these changes Sep 13, 2022

View reviewed changes

tgross approved these changes Sep 13, 2022

View reviewed changes

shoenig reviewed Sep 13, 2022

View reviewed changes

devices: only free devices when ClientStatus is terminal

102af72

vercel bot deployed to Preview – nomad-storybook-and-ui September 13, 2022 17:39 View deployment

test: output nicer failure message if err==nil

12f1450

schmichael merged commit 757c3c9 into main Sep 13, 2022

schmichael deleted the b-stopping-allocs-are-running branch September 13, 2022 19:52

vercel bot deployed to Preview – nomad-storybook-and-ui September 13, 2022 19:52 View deployment

schmichael mentioned this pull request Sep 22, 2022

test: add e2e for non-overlapping placements #14646

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: stopped-yet-running allocs are still running #10446

scheduler: stopped-yet-running allocs are still running #10446

notnoop commented Apr 25, 2021

tgross commented Apr 26, 2021

notnoop commented Apr 28, 2021

schmichael left a comment

schmichael May 4, 2021

schmichael Jun 30, 2022

hashicorp-cla commented Nov 10, 2021 •

edited

Loading

benbuzbee commented Jan 6, 2022

schmichael commented Jan 7, 2022

schmichael commented Jun 30, 2022 •

edited

Loading

schmichael commented Jul 22, 2022

schmichael commented Aug 23, 2022 •

edited

Loading

schmichael Sep 12, 2022

tgross Sep 13, 2022

shoenig Sep 13, 2022

schmichael Sep 13, 2022

schmichael commented Sep 12, 2022

schmichael left a comment

lgfa29 left a comment

lgfa29 Sep 13, 2022

tgross left a comment

tgross commented Sep 13, 2022

shoenig Sep 13, 2022

shoenig Sep 13, 2022

schmichael commented Sep 13, 2022

schmichael commented Sep 13, 2022

github-actions bot commented Jan 12, 2023

	s1, cleanupS1 := TestServer(t, func(c *Config) {
	})
	s1, cleanupS1 := TestServer(t, nil)

scheduler: stopped-yet-running allocs are still running #10446

scheduler: stopped-yet-running allocs are still running #10446

Conversation

notnoop commented Apr 25, 2021

An Alternative

tgross commented Apr 26, 2021

notnoop commented Apr 28, 2021

schmichael left a comment

Choose a reason for hiding this comment

TODO

Client-coordinated 👎

Server-coordinated (this) risk

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashicorp-cla commented Nov 10, 2021 • edited Loading

benbuzbee commented Jan 6, 2022

schmichael commented Jan 7, 2022

schmichael commented Jun 30, 2022 • edited Loading

schmichael commented Jul 22, 2022

schmichael commented Aug 23, 2022 • edited Loading

Original behavior

Implemented behavior

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael commented Sep 12, 2022

schmichael left a comment

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael commented Sep 13, 2022

schmichael commented Sep 13, 2022

github-actions bot commented Jan 12, 2023

hashicorp-cla commented Nov 10, 2021 •

edited

Loading

schmichael commented Jun 30, 2022 •

edited

Loading

schmichael commented Aug 23, 2022 •

edited

Loading