-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: fix raftMu
locking order in Replica.evalAndPropose
#71923
Conversation
This ends up applying the result triggers using the client's context: cockroach/pkg/kv/kvserver/replica_raft.go Line 125 in 6ecd426
This doesn't seem right to me, but it's what the old code did, so I kept the behavior to be conservative. Should we split off a separate background context to use for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up applying the result triggers using the client's context:
I think this is working as intended. The context is not cancelled at this point and the results "belong" to that workstream (and in particular we want to retain any log tags, etc, in the Context). What we should not do is fork off async work using that context, but we don't do that, do we? The forking-off point is where a new context should be made.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/replica_raft.go, line 78 at r1 (raw file):
// - a closure used to finish a local result if no Raft command was proposed. If // non-nil, the caller must call this to process the local result and yield // a channel response. The closure may take out raftMu.
.. and thus must not be invoked while holding readOnlyCmdMu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
It's still a client context, right? So if the client disconnects at this point, we may end up not processing the triggers in cockroach/pkg/kv/kvserver/replica_proposal.go Line 637 in 86e2ede
Seems like we'd want to execute those regardless, no? |
6ecd426
to
4328d57
Compare
Hmm, yeah, it's a bit of a grey area. Generally speaking, if there is a trigger on the client goroutine that absolutely must execute for correctness, then there is a problem (that goroutine could always "never happen" because of an ill-timed crash). Anything that is "required" must go through replication where it is then executed by the state machine apply loop. Now this is the purist argument, but I agree that in practice the risk of slipping an unnecessary bug in is higher when using the client context. However, deriving a new context from the root context raises more questions - what timeout, is it ok to forget if the work "belongs" to a tenant, etc. So maybe the real refactor here is to not call the full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this and writing up this fix! It looks like it will address the issue.
That said, I'm starting to second-guess this whole business of acquiring the raft mutex above the proposalBuf
, which we'd like to be able to consider the boundary between leaseholder evaluation and raft processing. This was arguably too complex even just with the changes needed in #68045, and it looks like that PR caused a lock ordering bug. I wonder whether we should reconsider option 2 floated in the PR description. Writes that set the MaybeGossipSystemConfig
/MaybeGossipSystemConfigIfHaveFailure
/MaybeGossipNodeLiveness
triggers are not performance sensitive, so they benefit very little from being able to avoid consensus in some edge cases. The first two are also going away soon (#70560). It would be easy enough to add these three conditions are reasons why a write request needs consensus. For instance, we could add a RequiresRaft() bool
method to result.LocalResult
and add that as a condition for needConsensus
.
This would also avoid the question of which context to use.
Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)
Good point.
This seems very appealing, I'll whip up a PR after some meetings. As for 21.2, we'd want to be extremely conservative since we're already preparing the final release candidate. I'm having a hard time gauging which of these approaches are less risky. Fixing the deadlock as done in this PR is an obvious low-risk improvement, but could bypassing consensus for these triggers cause any other serious issues? If not, I'm inclined towards merging this PR as-is and doing the rest of the work in a follow-up PR that we can leave to bake for a while. |
My feeling is that 584fb97 (which as you pointed out has never been released yet and therefore never baked in the wild) + this PR is riskier than disabling no-op writes for operations that set one of these three triggers. These first two changes have a series of new code paths that are rarely exercised and likely undertested. Meanwhile, the vast majority of writes that set these triggers go through Raft, so we know those paths are much more well tested. So I'd consider option 3 from 584fb97 to be a premature optimization that we decided was not worth the associated risk or code complexity. Option 2 reverts the optimization. |
Thanks @nvanbenschoten, I'm leaning in that direction as well. |
Just a heads up that the backport for this PR was just approved (preemptively) by @isaactwong for inclusion in |
Ok. I'll see if I can revert the optimizations in 9f8c019 tonight and get another PR ready. Otherwise, I suppose we have this PR in our back pocket. |
Following 584fb97, `Replica.handleReadWriteLocalEvalResult` may take out `raftMu`. However, this was called from `Replica.evalAndPropose` via `Replica.executeWriteBatch` which already held `readOnlyCmdMu`. This violates the store locking protocol, as `raftMu` must be taken out before `readOnlyCmdMu`, which may lead to deadlocks. This patch defers the `Replica.handleReadWriteLocalEvalResult` call until after `readOnlyCmdMu` has been released, by returning a `finishLocal` closure that must be called by the caller when appropriate. Release note: None
4328d57
to
e5254ce
Compare
Hold up, that wasn't quite the suggestion. I think we still want the commit you linked to. The suggested idea was to revert 584fb97 and instead go with option 2 detailed in that commit message. |
Yep, that's the plan -- sorry for the inaccuracy. |
No problem, just wanted to make sure we were on the same page before you spent time working on the alternate fix. |
Replaced by #71949. We can reopen this if that PR doesn't pan out. |
Following 584fb97,
Replica.handleReadWriteLocalEvalResult
may take outraftMu
. However, this was called fromReplica.evalAndPropose
via
Replica.executeWriteBatch
which already heldreadOnlyCmdMu
.This violates the store locking protocol, as
raftMu
must be takenout before
readOnlyCmdMu
, which may lead to deadlocks.This patch defers the
Replica.handleReadWriteLocalEvalResult
calluntil after
readOnlyCmdMu
has been released, by returning afinishLocal
closure that must be called by the caller whenappropriate.
Resolves #71646.
Release note: None