Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix raftMu locking order in Replica.evalAndPropose #71923

Closed

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Oct 25, 2021

Following 584fb97, Replica.handleReadWriteLocalEvalResult may take out
raftMu. However, this was called from Replica.evalAndPropose
via Replica.executeWriteBatch which already held readOnlyCmdMu.
This violates the store locking protocol, as raftMu must be taken
out before readOnlyCmdMu, which may lead to deadlocks.

This patch defers the Replica.handleReadWriteLocalEvalResult call
until after readOnlyCmdMu has been released, by returning a
finishLocal closure that must be called by the caller when
appropriate.

Resolves #71646.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

This ends up applying the result triggers using the client's context:

r.handleReadWriteLocalEvalResult(ctx, *proposal.Local, false /* raftMuHeld */)

This doesn't seem right to me, but it's what the old code did, so I kept the behavior to be conservative. Should we split off a separate background context to use for handleReadWriteLocalEvalResult(), or is there a legitimate reason to use the client's context here?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up applying the result triggers using the client's context:

I think this is working as intended. The context is not cancelled at this point and the results "belong" to that workstream (and in particular we want to retain any log tags, etc, in the Context). What we should not do is fork off async work using that context, but we don't do that, do we? The forking-off point is where a new context should be made.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/replica_raft.go, line 78 at r1 (raw file):

// - a closure used to finish a local result if no Raft command was proposed. If
//   non-nil, the caller must call this to process the local result and yield
//   a channel response. The closure may take out raftMu.

.. and thus must not be invoked while holding readOnlyCmdMu.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

@erikgrinaker
Copy link
Contributor Author

The context is not cancelled at this point

It's still a client context, right? So if the client disconnects at this point, we may end up not processing the triggers in handleReadWriteLocalEvalResult:

func (r *Replica) handleReadWriteLocalEvalResult(

Seems like we'd want to execute those regardless, no?

@tbg
Copy link
Member

tbg commented Oct 25, 2021

Hmm, yeah, it's a bit of a grey area. Generally speaking, if there is a trigger on the client goroutine that absolutely must execute for correctness, then there is a problem (that goroutine could always "never happen" because of an ill-timed crash). Anything that is "required" must go through replication where it is then executed by the state machine apply loop. Now this is the purist argument, but I agree that in practice the risk of slipping an unnecessary bug in is higher when using the client context. However, deriving a new context from the root context raises more questions - what timeout, is it ok to forget if the work "belongs" to a tenant, etc. So maybe the real refactor here is to not call the full handleReadWriteLocalEvalResult, but a method that only handles the side effects that can actually occur on the unreplicated path, and where it is clear that they may not occur despite the operation having taken place (though perhaps without returning a result to the client), and where we are prudent about ignoring context cancellation where it is useful.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and writing up this fix! It looks like it will address the issue.

That said, I'm starting to second-guess this whole business of acquiring the raft mutex above the proposalBuf, which we'd like to be able to consider the boundary between leaseholder evaluation and raft processing. This was arguably too complex even just with the changes needed in #68045, and it looks like that PR caused a lock ordering bug. I wonder whether we should reconsider option 2 floated in the PR description. Writes that set the MaybeGossipSystemConfig/MaybeGossipSystemConfigIfHaveFailure/MaybeGossipNodeLiveness triggers are not performance sensitive, so they benefit very little from being able to avoid consensus in some edge cases. The first two are also going away soon (#70560). It would be easy enough to add these three conditions are reasons why a write request needs consensus. For instance, we could add a RequiresRaft() bool method to result.LocalResult and add that as a condition for needConsensus.

This would also avoid the question of which context to use.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

Anything that is "required" must go through replication where it is then executed by the state machine apply loop.

Good point.

Writes that set the MaybeGossipSystemConfig/MaybeGossipSystemConfigIfHaveFailure/MaybeGossipNodeLiveness triggers are not performance sensitive, so they benefit very little from being able to avoid consensus in some edge cases. The first two are also going away soon (#70560). It would be easy enough to add these three conditions are reasons why a write request needs consensus. For instance, we could add a RequiresRaft() bool method to result.LocalResult and add that as a condition for needConsensus.

This seems very appealing, I'll whip up a PR after some meetings.

As for 21.2, we'd want to be extremely conservative since we're already preparing the final release candidate. I'm having a hard time gauging which of these approaches are less risky. Fixing the deadlock as done in this PR is an obvious low-risk improvement, but could bypassing consensus for these triggers cause any other serious issues? If not, I'm inclined towards merging this PR as-is and doing the rest of the work in a follow-up PR that we can leave to bake for a while.

@nvanbenschoten
Copy link
Member

As for 21.2, we'd want to be extremely conservative since we're already preparing the final release candidate. I'm having a hard time gauging which of these approaches are less risky. Fixing the deadlock as done in this PR is an obvious low-risk improvement, but could bypassing consensus for these triggers cause any other serious issues? If not, I'm inclined towards merging this PR as-is and doing the rest of the work in a follow-up PR that we can leave to bake for a while.

My feeling is that 584fb97 (which as you pointed out has never been released yet and therefore never baked in the wild) + this PR is riskier than disabling no-op writes for operations that set one of these three triggers. These first two changes have a series of new code paths that are rarely exercised and likely undertested. Meanwhile, the vast majority of writes that set these triggers go through Raft, so we know those paths are much more well tested.

So I'd consider option 3 from 584fb97 to be a premature optimization that we decided was not worth the associated risk or code complexity. Option 2 reverts the optimization.

@erikgrinaker
Copy link
Contributor Author

Thanks @nvanbenschoten, I'm leaning in that direction as well.

@tbg
Copy link
Member

tbg commented Oct 25, 2021

Just a heads up that the backport for this PR was just approved (preemptively) by @isaactwong for inclusion in betarc.3; this means this PR should merge+backport by tomorrow EOD (with maybe a bit of leeway into Wednesday but better not to go there :-))

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 25, 2021

Ok. I'll see if I can revert the optimizations in 9f8c019 tonight and get another PR ready. Otherwise, I suppose we have this PR in our back pocket.

Following 584fb97, `Replica.handleReadWriteLocalEvalResult` may take out
`raftMu`. However, this was called from `Replica.evalAndPropose`
via `Replica.executeWriteBatch` which already held `readOnlyCmdMu`.
This violates the store locking protocol, as `raftMu` must be taken
out before `readOnlyCmdMu`, which may lead to deadlocks.

This patch defers the `Replica.handleReadWriteLocalEvalResult` call
until after `readOnlyCmdMu` has been released, by returning a
`finishLocal` closure that must be called by the caller when
appropriate.

Release note: None
@nvanbenschoten
Copy link
Member

Hold up, that wasn't quite the suggestion. I think we still want the commit you linked to. The suggested idea was to revert 584fb97 and instead go with option 2 detailed in that commit message.

@erikgrinaker
Copy link
Contributor Author

Hold up, that wasn't quite the suggestion. I think we still want the commit you linked to. The suggested idea was to revert 584fb97 and instead go with option 2 detailed in that commit message.

Yep, that's the plan -- sorry for the inaccuracy.

@nvanbenschoten
Copy link
Member

No problem, just wanted to make sure we were on the same page before you spent time working on the alternate fix.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Oct 25, 2021

Replaced by #71949. We can reopen this if that PR doesn't pan out.

@erikgrinaker erikgrinaker deleted the raft-eval-deadlock branch October 27, 2021 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ccl/multiregionccl: TestMultiRegionDataDriven failed
4 participants