fix deadlock in plan_apply #13407

tgross · 2022-06-16T20:56:43Z

The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the indexCh forever, and
all schedulers will block in Plan.Submit.

Closing the indexCh when the asyncPlanWait is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

Note for reviewers: the test code here can't actually exercise this bug,
because there's no interface for us to introduce this timeout. But this
code path was not covered by unit testing, so with these tests we at least
exercise the normal state.

6fab937 is how we verified the behavior and that the fix works, but we'll
want to circle back later to build out some fault injection into the state
store code at some point.

The deadlocked goroutine stack looks like this:

goroutine 625541538 [chan receive, 514 minutes]:
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000a410a0)
	github.com/hashicorp/nomad/nomad/plan_apply.go:156 +0x35e
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
	github.com/hashicorp/nomad/nomad/leader.go:250 +0x21e

cc @jazzyfresh @lgfa29 @mikenomitch
Fixes #10289

nomad/state/state_store.go

nomad/plan_apply.go

The plan applier has to get a snapshot with a minimum index for the plan it's working on in order to ensure consistency. Under heavy raft loads, we can exceed the timeout. When this happens, we hit a bug where the plan applier blocks waiting on the `indexCh` forever, and all schedulers will block in `Plan.Submit`. Closing the `indexCh` when the `asyncPlanWait` is done with it will prevent the deadlock without impacting correctness of the previous snapshot index. This changeset includes the a PoC failing test that works by injecting a large timeout into the state store. We need to turn this into a test we can run normally without breaking the state store before we can merge this PR.

This changeset is a proof-of-concept for a fault injection interface into the `FSM.Apply` function. This would allow us to introduce timeouts or errors in unit testing by adding a LogApplier implementation to a map of `interceptionAppliers`. This is similar to how we register LogAppliers for the enterprise FSM functions currently. Most interception appliers are expected to then call the normal applier directly. This was developed initially for #13407 but can't be used to reproduce that particular bug. But I'm opening this PR for further discussion about whether this is a worthwhile tool to have for testing otherwise.

This timeout creates backpressure where any concurrent `Plan.Submit` RPCs will block waiting for results. This sheds load across all servers and gives raft some CPU to catch up, because schedulers won't dequeue more work while waiting. Increase it to 10s based on observations of large production clusters.

.changelog/13407.txt

schmichael · 2022-06-21T18:54:02Z

nomad/plan_apply.go

@@ -159,6 +159,7 @@ func (p *planner) planApply() {
 			if err != nil {
 				p.logger.Error("failed to update snapshot state", "error", err)
 				pending.respond(nil, err)
+				planIndexCh = nil


Should this be on L157 right after we receive from the chan?

AFAICT the only case where this move would make a difference is if applyPlan on L168 returns a non-nil error. It would then continue with a non-nil planIndexCh that will never receive another result.

I think this change (immediately nil'ing the chan) would match the behavior of the only other place we receiving on planIndexCh: L105 in the select. The case where we receive on the chan immediately and unconditionally sets it nil there as well.

To put another way: I think planIndexCh is always a one-shot chan. It will always receive exactly one index value and never get reused.

If you think that's accurate I wonder if we should wrap planIndexCh in a little struct with a method like:

func (t *T) Recv() int { v := <-t.ch t.ch = nil return v }

We also use planIndexCh == nil to signal "no outstanding plan application", so our struct would probably need a helper method for that too. Unsure if the extra struct would simplify or complicate reading this code... just an idea.

Should this be on L157 right after we receive from the chan?

Yup, good catch. I was focused on the error branch because there's where the bug showed up, but there's no reason to keep it once we've received a value.

To put another way: I think planIndexCh is always a one-shot chan. It will always receive exactly one index value and never get reused.

Yes, agreed!

If you think that's accurate I wonder if we should wrap planIndexCh in a little struct with a method like:
...
We also use planIndexCh == nil to signal "no outstanding plan application", so our struct would probably need a helper method for that too. Unsure if the extra struct would simplify or complicate reading this code... just an idea.

Hm... I like this idea in principle. But the logic here leans heavily on the specific semantics of channels (ex. a closed channel isn't nil) and wrapping it in a struct just seems like it'll obscure that by moving the logic away from where we're using it. The fallthrough select on line 104-113 makes this especially gross because it relies on the fact that we get 0 on error, so we can't even call recv() and check for 0 there so it needs it's own index, ok := recvOrDefault() function.

some example code

type asyncPlanIndex struct { planIndexCh chan uint64 } func (a *asyncPlanIndex) hasOutstandingPlan() bool { return a.planIndexCh == nil } func (a *asyncPlanIndex) recv() uint64 { v := <-a.planIndexCh a.planIndexCh = nil return v } func (a *asyncPlanIndex) recvOrDefault() (uint64, bool) { select { case v := <-a.planIndexCh: return v, true default: return 0, false } } func (a *asyncPlanIndex) send(v uint64) { a.planIndexCh <- v close(a.planIndexCh) }

But the logic here leans heavily on the specific semantics of channels

Oof yeah, this is a pattern to watch out for in the future. Not always a problem but clearly made this code more fragile than if a little effort had been put into encapsulating the logic in a little purpose-built struct.

schmichael · 2022-06-21T22:59:53Z

nomad/plan_apply.go

+	// Plan.Submit RPCs will block waiting for results. This sheds
+	// load across all servers and gives raft some CPU to catch up,
+	// because schedulers won't dequeue more work while waiting.
+	const timeout = 10 * time.Second


LGTM - The following are observations/thoughts I just wanted to record somewhere in case they're useful (or in case my "LGTM" is based on flawed assumptions).

This does make this code differ from the worker's use of SnapshotMinIndex via worker.snapshotMinIndex. I think this is probably ideal:

The plan applier blocking here puts a lot of backpressure on all schedulers and should dramatically help in cases where Raft/FSM is really struggling.

The shorter 5s min index timeout still on the worker seems fine because it fails up to the main eval retry loop which introduces its own additional retries.

...although looking at the code around wait-for-index failures in the worker it appears we lose/drop the RefreshIndex sent by the server if we hit that timeout-and-retry case. This would cause the retry to use its existing old snapshot. Probably not intentional but probably optimal as it lets workers keep creating plans and the plan applier will toss out the invalid bits with its stricter view of state.

schmichael · 2022-06-21T23:03:48Z

nomad/plan_apply.go

@@ -368,14 +373,12 @@ func updateAllocTimestamps(allocations []*structs.Allocation, timestamp int64) {
 func (p *planner) asyncPlanWait(indexCh chan<- uint64, future raft.ApplyFuture,
 	result *structs.PlanResult, pending *pendingPlan) {
 	defer metrics.MeasureSince([]string{"nomad", "plan", "apply"}, time.Now())
+	defer close(indexCh)


Would this alone have fixed the deadlock? I think so and since we always use this index inside a max(...) it seems like an always safe operation.

Oof, good lesson about writing defensive code the first time round.

You never know if you will have time to come back and make it better in a second pass 🤣

Would this alone have fixed the deadlock?

Yes! That's how we did the initial test and fix. Nil'ing the channel gets us a few additional things:

It's belt-and-suspenders so that we know there's no possible way we're still polling on the channel.

It makes the usage consistent so that the reader doesn't wonder why the channel is being nil'd in one spot but not the other.

It locally communicates to the reader that the channel is being discarded, as opposed to having to go read the asyncPlanWait code further down (and doing so is super cheap).

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

tgross · 2022-06-23T16:05:59Z

Going to merge this to ship in the upcoming Nomad 1.3.2. We've got some additional test work to do to improve our confidence in this overall area of the code, but that's not a blocker to shipping this fix.

github-actions · 2022-10-22T02:42:28Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added the type/bug label Jun 16, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui June 16, 2022 20:58 View deployment

tgross commented Jun 16, 2022

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

tgross commented Jun 16, 2022

View reviewed changes

nomad/plan_apply.go Outdated Show resolved Hide resolved

tgross force-pushed the plan-apply-deadlock branch from a5e57dd to 2f7f862 Compare June 17, 2022 13:01

vercel bot deployed to Preview – nomad-storybook-and-ui June 17, 2022 13:04 View deployment

tgross added this to the 1.3.2 milestone Jun 17, 2022

tgross added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line theme/plan labels Jun 17, 2022

remove temporarily broken state store code

6fab937

vercel bot deployed to Preview – nomad-storybook-and-ui June 17, 2022 18:11 View deployment

changelog entry

5e0964e

vercel bot deployed to Preview – nomad-storybook-and-ui June 17, 2022 18:31 View deployment

tgross mentioned this pull request Jun 17, 2022

FSM fault injection #13419

Closed

tgross mentioned this pull request Jun 17, 2022

Nomad scheduling halting #10289

Closed

vercel bot deployed to Preview – nomad-storybook-and-ui June 21, 2022 18:35 View deployment

tgross mentioned this pull request Jun 21, 2022

Better metric name for nomad.broker.total_blocked #6480

Closed

schmichael approved these changes Jun 21, 2022

View reviewed changes

Apply suggestions from code review

6fbd2a8

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

vercel bot deployed to Preview – nomad-storybook-and-ui June 22, 2022 13:48 View deployment

move up nil of indexCh

2a358ac

vercel bot deployed to Preview – nomad-storybook-and-ui June 22, 2022 14:15 View deployment

schmichael approved these changes Jun 22, 2022

View reviewed changes

tgross marked this pull request as ready for review June 22, 2022 18:43

tgross merged commit 0213a36 into main Jun 23, 2022

tgross deleted the plan-apply-deadlock branch June 23, 2022 16:06

This was referenced Jun 23, 2022

Backport of fix deadlock in plan_apply into release/1.1.x #13469

Merged

Backport of fix deadlock in plan_apply into release/1.2.x #13470

Merged

Backport of fix deadlock in plan_apply into release/1.3.x #13471

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix deadlock in plan_apply #13407

fix deadlock in plan_apply #13407

tgross commented Jun 16, 2022 •

edited

Loading

schmichael Jun 21, 2022

tgross Jun 22, 2022

schmichael Jun 22, 2022

schmichael Jun 21, 2022

schmichael Jun 21, 2022

robloxrob Jun 21, 2022

tgross Jun 22, 2022 •

edited

Loading

tgross commented Jun 23, 2022

github-actions bot commented Oct 22, 2022

fix deadlock in plan_apply #13407

fix deadlock in plan_apply #13407

Conversation

tgross commented Jun 16, 2022 • edited Loading

schmichael Jun 21, 2022

Choose a reason for hiding this comment

tgross Jun 22, 2022

Choose a reason for hiding this comment

schmichael Jun 22, 2022

Choose a reason for hiding this comment

schmichael Jun 21, 2022

Choose a reason for hiding this comment

schmichael Jun 21, 2022

Choose a reason for hiding this comment

robloxrob Jun 21, 2022

Choose a reason for hiding this comment

tgross Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

tgross commented Jun 23, 2022

github-actions bot commented Oct 22, 2022

tgross commented Jun 16, 2022 •

edited

Loading

tgross Jun 22, 2022 •

edited

Loading