scheduler: fix reconciliation of reconnecting allocs #16609

lgfa29 · 2023-03-22T15:09:20Z

When a disconnect client reconnects the allocReconciler must find the
allocations that were created to replace the original disconnected
allocations.

This process was being done in only a subset of non-terminal untainted
allocations, meaning that, if the replacement allocations were not in
this state the reconciler didn't stop them, leaving the job in an
inconsistent state.

This inconsistency is only solved in a future job evaluation, but at
that point the allocation is considered reconnected and so the specific
reconnection logic was not applied, leading to unexpected outcomes.

This commit fixes the problem by running reconnecting allocation
reconciliation logic earlier into the process, leaving the rest of the
reconciler oblivious of reconnecting allocations.

It also uses the full set of allocations to search for replacements,
stopping them even if they are not in the untainted set.

The system SystemScheduler is not affected by this bug because
disconnected clients don't trigger replacements: every eligible client
is already running an allocation.

Closes #15483

When a disconnect client reconnects the `allocReconciler` must find the allocations that were created to replace the original disconnected allocations. This process was being done in only a subset of non-terminal untainted allocations, meaning that, if the replacement allocations were not in this state the reconciler didn't stop them, leaving the job in an inconsistent state. This inconsistency is only solved in a future job evaluation, but at that point the allocation is considered reconnected and so the specific reconnection logic was not applied, leading to unexpected outcomes. This commit fixes the problem by running reconnecting allocation reconciliation logic earlier into the process, leaving the rest of the reconciler oblivious of reconnecting allocations. It also uses the full set of allocations to search for replacements, stopping them even if they are not in the `untainted` set. The system `SystemScheduler` is not affected by this bug because disconnected clients don't trigger replacements: every eligible client is already running an allocation.

lgfa29 · 2023-03-22T19:34:25Z

scheduler/reconcile.go

 	desiredChanges.Stop += uint64(len(stop))
 	untainted = untainted.difference(stop)

-	// Validate and add reconnecting allocs to the plan so that they will be logged.
-	a.computeReconnecting(reconnecting)
-	desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates))


I haven't tracked down exactly why, but pulling this line to the reconnecting if block above causes a miscount of Ignore.

I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove variable decrementing, maybe that was resulting in a change to the ignore set?

Oh, I see it now. computeUpdates find allocs to ignore from the untainted set. Before the reconnecting allocs were not included there, but now they are (if not stopped), so if we increment Ignore during the reconnect reconciliation we will double-count the reconnecting allocs.

lgfa29 · 2023-03-22T19:36:33Z

scheduler/reconcile.go

-			remove--
-			// if we've removed all we need to, stop iterating and return.
-			if remove == 0 {
-				return remove
-			}


I think this may have been a bug as well where remove was miscalculated so we were only stopping a fixed number of replacements, not all of them. But I didn't investigate further since we don't do this anymore.

Good call: removing this logic lets computeStop do its job to reduce the total count if we need to, without any spooky action at a distance.

lgfa29 · 2023-03-22T19:39:29Z

scheduler/reconcile.go

+	replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil ||
+		(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil &&
+			replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore)


The existing implementation skips allocations without metrics, which I think may also be a source of inconsistent state. I don't know when MaxNormScore could be nil, but I turned those errors into a "best effort" check by preferring the original alloc unless enough metrics are available.

Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore is only ever nil in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍

lgfa29 · 2023-03-22T19:44:25Z

scheduler/reconcile.go

+	// Check if the replacement has better client status.
+	// Even with a better placement score make sure we don't replace a running
+	// allocation with one that is not.
+	replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning
+	originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning


This is a new rule I added to prevent a replacement if the original alloc is still running but the replacement is not, even if it has better placement metric.

I first added it as an attempt to fix the original bug but it turns out the problem was something else. So I don't think this is strictly necessary, but it kind makes sense to not replace a running alloc with one that is not not?

There's 3 sets of client states to worry about here: pending (for the replacement only), running, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect() check (line 1082R), the stopReconnecting check (line 1090R) and the replacementAlloc.ServerTerminalStatus() check (line 1131R). That gives us the following table:

original client status replacement client status replacement server status what to do? done in this PR?

running running run keep highest score, stop the other ✔️

running running stop keep original, stop replacement ✔️

running pending run keep highest score, stop the other ✔️

running pending stop keep original, stop replacement ✔️

running terminal run keep original, stop replacement ✔️

running terminal stop keep original, stop replacement ✔️

terminal running run stop original, reconcile replacement ❌ (replacement not checked here?)

terminal running stop stop original, reconcile replacement ✔️

terminal pending run stop original, reconcile replacement ❌ (replacement not checked here?)

terminal pending stop stop original, reconcile replacement ✔️

terminal terminal run stop original, reconcile replacement ✔️

terminal terminal stop stop original, reconcile replacement ✔️

So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?

I had to build a big ol' truth table to verify that for myself

Yup, I had to do the same thing, which should've been a signal that this logic is confusing 😅

But I looked into it again, and I'm not sure if we can move this ClientStatus check to the caller because it's an awkward conditional tie-breaker.

The way I approached this was using a process of elimination. There are three states we can safely stop the reconnecting allocation without needing to compare it with its replacements:

desired status not run (so stop or evict): handled by stopReconnecting, but now I'm wondering if we even need to stop in this case as opposed to just skip these allocs 🤷

client status failed: handled by filterByFailedReconnect, but we probably don't need this anymore and just handle it in the main loop.

That leaves 5 possible client status for original alloc:

lost and unknown: I don't think these are possible for a reconnecting alloc? I think lost is only used for groups without max_client_disconnect (which is by definition not the case for allocation that are reconnecting) and the allocation would not be in the reconnecting set if it was unknown, so we can probably ignore these.

complete: probably a no-op? Or maybe we still need to check and stop replacements? But only if not batch? Not sure, this seems a bit undefined 😅

pending and running: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.

And this is where the awkwardness lives. Outside this function we can't make a decision before comparing them and we don't know if the replacement was picked because it had a higher version or better placement (well, we can check again, but that would be kind of redundant).

I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler?

That's right. My assumption here is the idea of merging timelines. When the client reconnects we assume the replacement allocations were being handled by the reconciler like any other, so if we don't need to stop them just let the reconciler keep managing it as it has been.

client status failed: handled by filterByFailedReconnect, but we probably don't need this anymore and just handle it in the main loop.

The filterByFailedReconnect should pick up the complete case as well because that's terminal.

pending and running: we need to compare with its replacements. If the replacement is newer we keep it, done. If the replacement has better score we need to check the client status.

I probably could have explained it better, but I think we only need to compare if the replacement is running. So in other words, the check we're doing here could be before we compare scores and return original early. (And at that point, it could probably be hoisted up to the caller instead.)

The filterByFailedReconnect should pick up the complete case as well because that's terminal.

It only looks for not server terminal and failed explicitly, so I think complete is not handled.

nomad/scheduler/reconcile_util.go

Line 529 in 9a2fdb5

if !alloc.ServerTerminalStatus() && alloc.ClientStatus == structs.AllocClientStatusFailed {

I think we only need to compare if the replacement is running

...and has the same job version.

A new job version always wins because we don't want to keep outdated allocs, even if they are the original ones. So the client status check can be done before the score, but it can't be done before the job version check.

Ah, good catch. Yeah it seems like there's no way to hoist this up.

lgfa29 · 2023-03-22T19:46:05Z

scheduler/reconcile_test.go

@@ -5490,14 +5535,15 @@ func TestReconciler_Disconnected_Client(t *testing.T) {
 			},
 		},
 		{
-			name:                         "stop-original-alloc-with-old-job-version-and-failed-replacements",
+			name:                         "stop-original-alloc-with-old-job-version-and-failed-replacements-replaced",


I thought this test was redundant because, in the end, the replacement of the replacements were running, but it exercises the path where a reconnecting alloc has multiple replacements.

lgfa29 · 2023-03-22T19:48:44Z

scheduler/reconcile_test.go

+							nextReplacement.PreviousAllocation = replacement.ID
+
+							replacement.NextAllocation = nextReplacement.ID
+							replacement.DesiredStatus = structs.AllocDesiredStatusStop


This was missing but I'm pretty sure we stop the previous allocation once it has been replaced 🤔

lgfa29 · 2023-03-23T14:37:59Z

scheduler/reconcile.go

+
+		// Find replacement allocations and decide which one to stop. A
+		// reconnecting allocation may have multiple replacements.
+		for _, replacementAlloc := range others {


I think I also need to check that replacementAlloc.CreateIndex > reconnectingAlloc.CreateIndex otherwise in a scenario where both the original and one of its replacements are reconnecting, and they tie in all selection criteria, I think one alloc will stop the other because they will both think they are the original.

Yup! Fixed in 0b97e34.

tgross

This looks great overall. I've left a large and rambling comment about legibility of the status checks we're that you might want to consider.

tgross · 2023-03-23T14:25:17Z

scheduler/reconcile.go

+	// Check if the replacement has better client status.
+	// Even with a better placement score make sure we don't replace a running
+	// allocation with one that is not.
+	replacementIsRunning := replacement.ClientStatus == structs.AllocClientStatusRunning
+	originalNotRunning := original.ClientStatus != structs.AllocClientStatusRunning


There's 3 sets of client states to worry about here: pending (for the replacement only), running, and terminal. We also have three checks in the caller that influence whether we ever even call this function: the reconnecting.filterByFailedReconnect() check (line 1082R), the stopReconnecting check (line 1090R) and the replacementAlloc.ServerTerminalStatus() check (line 1131R). That gives us the following table:

original client status replacement client status replacement server status what to do? done in this PR?

running running run keep highest score, stop the other ✔️

running running stop keep original, stop replacement ✔️

running pending run keep highest score, stop the other ✔️

running pending stop keep original, stop replacement ✔️

running terminal run keep original, stop replacement ✔️

running terminal stop keep original, stop replacement ✔️

terminal running run stop original, reconcile replacement ❌ (replacement not checked here?)

terminal running stop stop original, reconcile replacement ✔️

terminal pending run stop original, reconcile replacement ❌ (replacement not checked here?)

terminal pending stop stop original, reconcile replacement ✔️

terminal terminal run stop original, reconcile replacement ✔️

terminal terminal stop stop original, reconcile replacement ✔️

So it looks like the logic is correct except for maybe the case where the original is filtered by filterByFailedReconnect and the replacement is non-terminal. I'm not sure what happens to the replacement in that case -- does it just fall thru to the rest of the reconciler? In any case, I had to build a big ol' truth table to verify that for myself because the logic is split between this method and its caller. It's kinda hard to follow. Maybe we should move all these client status checks into the caller, so that we only ever call this method if we have two legitimate options?

tgross · 2023-03-23T14:30:04Z

scheduler/reconcile.go

+	replacementHasBetterScore := originalMaxScoreMeta == nil && replacementMaxScoreMeta != nil ||
+		(originalMaxScoreMeta != nil && replacementMaxScoreMeta != nil &&
+			replacementMaxScoreMeta.NormScore > originalMaxScoreMeta.NormScore)


Because these allocations are always existing allocations that have been thru the whole scheduler once already, I suspect NormScore is only ever nil in test scenarios and that was put in to catch panics because of bad test setup. I don't think it makes much of a difference and this code seems a little nicer. 👍

tgross · 2023-03-23T14:41:48Z

scheduler/reconcile.go

-			remove--
-			// if we've removed all we need to, stop iterating and return.
-			if remove == 0 {
-				return remove
-			}


Good call: removing this logic lets computeStop do its job to reduce the total count if we need to, without any spooky action at a distance.

tgross · 2023-03-23T14:50:32Z

scheduler/reconcile.go

 	desiredChanges.Stop += uint64(len(stop))
 	untainted = untainted.difference(stop)

-	// Validate and add reconnecting allocs to the plan so that they will be logged.
-	a.computeReconnecting(reconnecting)
-	desiredChanges.Ignore += uint64(len(a.result.reconnectUpdates))


I've spent an hour looking at it and can't figure it out either. You mentioned there was a bug around that remove variable decrementing, maybe that was resulting in a change to the ignore set?

If the replacement for a reconnecting allocation is also reconnecting we need to make sure we only compare the original with the replacement, and not the other way around, otherwise the replacement may stop the original if they tie in the selection criteria.

tgross

LGTM, nice work @lgfa29!

Since we are now already iterating over the reconnecting allocations in a specific method having a separate loop to find failed allocations is unnecessary.

lgfa29 · 2023-03-24T23:07:53Z

Doing some more edge case testing I thought I broke canary deployments, but it turned out to be an existing bug, so I filed #16644.

I also removed the filterByFailedReconnect in 129dda0 to keep the allocation selection logic closer together as discussed in #16609 (comment).

As for allocations that are complete on reconnect the pickReconnectingAlloc logic is applied, which I think is the right thing to do. If you have a batch job with max_client_disconnect you should expect it to run multiple times due to replacement allocations.

scheduler: reconnect alloc with an failed replacement

7e334ff

vercel bot deployed to Preview – nomad-storybook-and-ui March 22, 2023 15:14 View deployment

lgfa29 changed the title ~~scheduler: reconnect alloc with an failed replacement~~ scheduler: fix reconciliation of reconnecting allocs Mar 22, 2023

changelog: add entry for #16609

a87af84

lgfa29 added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Mar 22, 2023

lgfa29 requested review from schmichael, jrasell and tgross March 22, 2023 19:31

lgfa29 added this to the 1.5.x milestone Mar 22, 2023

lgfa29 added the hcc/cst Admin - internal label Mar 22, 2023

lgfa29 marked this pull request as ready for review March 22, 2023 19:32

lgfa29 commented Mar 22, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui March 22, 2023 19:36 View deployment

lgfa29 commented Mar 22, 2023

View reviewed changes

fix reconciler test

66241b8

lgfa29 commented Mar 22, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui March 22, 2023 19:52 View deployment

fix typo

f88c15a

vercel bot deployed to Preview – nomad-storybook-and-ui March 23, 2023 00:22 View deployment

lgfa29 commented Mar 23, 2023

View reviewed changes

tgross reviewed Mar 23, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui March 23, 2023 15:32 View deployment

tgross approved these changes Mar 24, 2023

View reviewed changes

scheduler: remove filterByFailedReconnect method

129dda0

Since we are now already iterating over the reconnecting allocations in a specific method having a separate loop to find failed allocations is unnecessary.

vercel bot deployed to Preview – nomad-storybook-and-ui March 24, 2023 23:07 View deployment

lgfa29 merged commit 72ad885 into main Mar 24, 2023

lgfa29 deleted the b-reconciler-reconnect-fail branch March 24, 2023 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: fix reconciliation of reconnecting allocs #16609

scheduler: fix reconciliation of reconnecting allocs #16609

lgfa29 commented Mar 22, 2023 •

edited

Loading

lgfa29 Mar 22, 2023

tgross Mar 23, 2023

lgfa29 Mar 23, 2023

lgfa29 Mar 22, 2023

tgross Mar 23, 2023

lgfa29 Mar 22, 2023

tgross Mar 23, 2023

lgfa29 Mar 22, 2023

tgross Mar 23, 2023

lgfa29 Mar 23, 2023

tgross Mar 23, 2023 •

edited

Loading

lgfa29 Mar 23, 2023

tgross Mar 23, 2023

lgfa29 Mar 22, 2023

lgfa29 Mar 22, 2023

lgfa29 Mar 23, 2023

lgfa29 Mar 23, 2023

tgross left a comment

tgross Mar 23, 2023

tgross Mar 23, 2023

tgross Mar 23, 2023

tgross Mar 23, 2023

tgross left a comment

lgfa29 commented Mar 24, 2023

original client status	replacement client status	replacement server status	what to do?	done in this PR?
running	running	run	keep highest score, stop the other	✔️
running	running	stop	keep original, stop replacement	✔️
running	pending	run	keep highest score, stop the other	✔️
running	pending	stop	keep original, stop replacement	✔️
running	terminal	run	keep original, stop replacement	✔️
running	terminal	stop	keep original, stop replacement	✔️
terminal	running	run	stop original, reconcile replacement	❌ (replacement not checked here?)
terminal	running	stop	stop original, reconcile replacement	✔️
terminal	pending	run	stop original, reconcile replacement	❌ (replacement not checked here?)
terminal	pending	stop	stop original, reconcile replacement	✔️
terminal	terminal	run	stop original, reconcile replacement	✔️
terminal	terminal	stop	stop original, reconcile replacement	✔️

scheduler: fix reconciliation of reconnecting allocs #16609

scheduler: fix reconciliation of reconnecting allocs #16609

Conversation

lgfa29 commented Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

lgfa29 commented Mar 24, 2023

lgfa29 commented Mar 22, 2023 •

edited

Loading

tgross Mar 23, 2023 •

edited

Loading