kvserver: fix `raftMu` locking order in `Replica.evalAndPropose` #71923

erikgrinaker · 2021-10-25T12:29:04Z

Following 584fb97, Replica.handleReadWriteLocalEvalResult may take out
raftMu. However, this was called from Replica.evalAndPropose
via Replica.executeWriteBatch which already held readOnlyCmdMu.
This violates the store locking protocol, as raftMu must be taken
out before readOnlyCmdMu, which may lead to deadlocks.

This patch defers the Replica.handleReadWriteLocalEvalResult call
until after readOnlyCmdMu has been released, by returning a
finishLocal closure that must be called by the caller when
appropriate.

Resolves #71646.

Release note: None

cockroach-teamcity · 2021-10-25T12:29:11Z

This change is

erikgrinaker · 2021-10-25T12:31:46Z

This ends up applying the result triggers using the client's context:

cockroach/pkg/kv/kvserver/replica_raft.go

Line 125 in 6ecd426

r.handleReadWriteLocalEvalResult(ctx, *proposal.Local, false /* raftMuHeld */)

This doesn't seem right to me, but it's what the old code did, so I kept the behavior to be conservative. Should we split off a separate background context to use for handleReadWriteLocalEvalResult(), or is there a legitimate reason to use the client's context here?

tbg

This ends up applying the result triggers using the client's context:

I think this is working as intended. The context is not cancelled at this point and the results "belong" to that workstream (and in particular we want to retain any log tags, etc, in the Context). What we should not do is fork off async work using that context, but we don't do that, do we? The forking-off point is where a new context should be made.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

pkg/kv/kvserver/replica_raft.go, line 78 at r1 (raw file):

// - a closure used to finish a local result if no Raft command was proposed. If
//   non-nil, the caller must call this to process the local result and yield
//   a channel response. The closure may take out raftMu.

.. and thus must not be invoked while holding readOnlyCmdMu.

tbg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

erikgrinaker · 2021-10-25T12:47:20Z

The context is not cancelled at this point

It's still a client context, right? So if the client disconnects at this point, we may end up not processing the triggers in handleReadWriteLocalEvalResult:

cockroach/pkg/kv/kvserver/replica_proposal.go

Line 637 in 86e2ede

func (r *Replica) handleReadWriteLocalEvalResult(

Seems like we'd want to execute those regardless, no?

tbg · 2021-10-25T12:56:05Z

Hmm, yeah, it's a bit of a grey area. Generally speaking, if there is a trigger on the client goroutine that absolutely must execute for correctness, then there is a problem (that goroutine could always "never happen" because of an ill-timed crash). Anything that is "required" must go through replication where it is then executed by the state machine apply loop. Now this is the purist argument, but I agree that in practice the risk of slipping an unnecessary bug in is higher when using the client context. However, deriving a new context from the root context raises more questions - what timeout, is it ok to forget if the work "belongs" to a tenant, etc. So maybe the real refactor here is to not call the full handleReadWriteLocalEvalResult, but a method that only handles the side effects that can actually occur on the unreplicated path, and where it is clear that they may not occur despite the operation having taken place (though perhaps without returning a result to the client), and where we are prudent about ignoring context cancellation where it is useful.

nvanbenschoten

Thanks for catching this and writing up this fix! It looks like it will address the issue.

That said, I'm starting to second-guess this whole business of acquiring the raft mutex above the proposalBuf, which we'd like to be able to consider the boundary between leaseholder evaluation and raft processing. This was arguably too complex even just with the changes needed in #68045, and it looks like that PR caused a lock ordering bug. I wonder whether we should reconsider option 2 floated in the PR description. Writes that set the MaybeGossipSystemConfig/MaybeGossipSystemConfigIfHaveFailure/MaybeGossipNodeLiveness triggers are not performance sensitive, so they benefit very little from being able to avoid consensus in some edge cases. The first two are also going away soon (#70560). It would be easy enough to add these three conditions are reasons why a write request needs consensus. For instance, we could add a RequiresRaft() bool method to result.LocalResult and add that as a condition for needConsensus.

This would also avoid the question of which context to use.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)

erikgrinaker · 2021-10-25T13:07:27Z

Anything that is "required" must go through replication where it is then executed by the state machine apply loop.

Good point.

Writes that set the MaybeGossipSystemConfig/MaybeGossipSystemConfigIfHaveFailure/MaybeGossipNodeLiveness triggers are not performance sensitive, so they benefit very little from being able to avoid consensus in some edge cases. The first two are also going away soon (#70560). It would be easy enough to add these three conditions are reasons why a write request needs consensus. For instance, we could add a RequiresRaft() bool method to result.LocalResult and add that as a condition for needConsensus.

This seems very appealing, I'll whip up a PR after some meetings.

As for 21.2, we'd want to be extremely conservative since we're already preparing the final release candidate. I'm having a hard time gauging which of these approaches are less risky. Fixing the deadlock as done in this PR is an obvious low-risk improvement, but could bypassing consensus for these triggers cause any other serious issues? If not, I'm inclined towards merging this PR as-is and doing the rest of the work in a follow-up PR that we can leave to bake for a while.

nvanbenschoten · 2021-10-25T13:26:58Z

As for 21.2, we'd want to be extremely conservative since we're already preparing the final release candidate. I'm having a hard time gauging which of these approaches are less risky. Fixing the deadlock as done in this PR is an obvious low-risk improvement, but could bypassing consensus for these triggers cause any other serious issues? If not, I'm inclined towards merging this PR as-is and doing the rest of the work in a follow-up PR that we can leave to bake for a while.

My feeling is that 584fb97 (which as you pointed out has never been released yet and therefore never baked in the wild) + this PR is riskier than disabling no-op writes for operations that set one of these three triggers. These first two changes have a series of new code paths that are rarely exercised and likely undertested. Meanwhile, the vast majority of writes that set these triggers go through Raft, so we know those paths are much more well tested.

So I'd consider option 3 from 584fb97 to be a premature optimization that we decided was not worth the associated risk or code complexity. Option 2 reverts the optimization.

erikgrinaker · 2021-10-25T13:53:58Z

Thanks @nvanbenschoten, I'm leaning in that direction as well.

tbg · 2021-10-25T15:12:24Z

Just a heads up that the backport for this PR was just approved (preemptively) by @isaactwong for inclusion in ~~beta~~rc.3; this means this PR should merge+backport by tomorrow EOD (with maybe a bit of leeway into Wednesday but better not to go there :-))

erikgrinaker · 2021-10-25T15:21:59Z

Ok. I'll see if I can revert the optimizations in 9f8c019 tonight and get another PR ready. Otherwise, I suppose we have this PR in our back pocket.

Following 584fb97, `Replica.handleReadWriteLocalEvalResult` may take out `raftMu`. However, this was called from `Replica.evalAndPropose` via `Replica.executeWriteBatch` which already held `readOnlyCmdMu`. This violates the store locking protocol, as `raftMu` must be taken out before `readOnlyCmdMu`, which may lead to deadlocks. This patch defers the `Replica.handleReadWriteLocalEvalResult` call until after `readOnlyCmdMu` has been released, by returning a `finishLocal` closure that must be called by the caller when appropriate. Release note: None

nvanbenschoten · 2021-10-25T16:00:34Z

Hold up, that wasn't quite the suggestion. I think we still want the commit you linked to. The suggested idea was to revert 584fb97 and instead go with option 2 detailed in that commit message.

erikgrinaker · 2021-10-25T16:16:17Z

Hold up, that wasn't quite the suggestion. I think we still want the commit you linked to. The suggested idea was to revert 584fb97 and instead go with option 2 detailed in that commit message.

Yep, that's the plan -- sorry for the inaccuracy.

nvanbenschoten · 2021-10-25T16:21:52Z

No problem, just wanted to make sure we were on the same page before you spent time working on the alternate fix.

erikgrinaker · 2021-10-25T17:48:27Z

Replaced by #71949. We can reopen this if that PR doesn't pan out.

erikgrinaker added the backport-21.2.x label Oct 25, 2021

erikgrinaker requested review from tbg and nvanbenschoten October 25, 2021 12:29

erikgrinaker self-assigned this Oct 25, 2021

erikgrinaker requested a review from a team as a code owner October 25, 2021 12:29

tbg reviewed Oct 25, 2021

View reviewed changes

tbg approved these changes Oct 25, 2021

View reviewed changes

erikgrinaker force-pushed the raft-eval-deadlock branch from 6ecd426 to 4328d57 Compare October 25, 2021 12:48

nvanbenschoten reviewed Oct 25, 2021

View reviewed changes

erikgrinaker force-pushed the raft-eval-deadlock branch from 4328d57 to e5254ce Compare October 25, 2021 15:42

erikgrinaker closed this Oct 25, 2021

erikgrinaker deleted the raft-eval-deadlock branch October 27, 2021 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: fix `raftMu` locking order in `Replica.evalAndPropose` #71923

kvserver: fix `raftMu` locking order in `Replica.evalAndPropose` #71923

erikgrinaker commented Oct 25, 2021 •

edited

Loading

cockroach-teamcity commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

tbg left a comment

tbg left a comment

erikgrinaker commented Oct 25, 2021

tbg commented Oct 25, 2021

nvanbenschoten left a comment

erikgrinaker commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

tbg commented Oct 25, 2021 •

edited

Loading

erikgrinaker commented Oct 25, 2021 •

edited

Loading

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021 •

edited

Loading

kvserver: fix raftMu locking order in Replica.evalAndPropose #71923

kvserver: fix raftMu locking order in Replica.evalAndPropose #71923

Conversation

erikgrinaker commented Oct 25, 2021 • edited Loading

cockroach-teamcity commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker commented Oct 25, 2021

tbg commented Oct 25, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

tbg commented Oct 25, 2021 • edited Loading

erikgrinaker commented Oct 25, 2021 • edited Loading

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021

nvanbenschoten commented Oct 25, 2021

erikgrinaker commented Oct 25, 2021 • edited Loading

kvserver: fix `raftMu` locking order in `Replica.evalAndPropose` #71923

kvserver: fix `raftMu` locking order in `Replica.evalAndPropose` #71923

erikgrinaker commented Oct 25, 2021 •

edited

Loading

tbg commented Oct 25, 2021 •

edited

Loading

erikgrinaker commented Oct 25, 2021 •

edited

Loading

erikgrinaker commented Oct 25, 2021 •

edited

Loading