Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: kvserver: synchronize replica removal with read-only requests #64370

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Apr 29, 2021

Backport 1/1 commits from #64324.

/cc @cockroachdb/release @cockroachdb/kv


Replica removal did not synchronize with in-flight read-only requests,
which could cause them to be evaluated on a removed (empty) replica,
returning an empty result.

This patch fixes the problem by locking Replica.readOnlyCmdMu during
replica removal, thus either waiting for read-only requests to complete
or not evaluating them.

Resolves #64325.

Release note (bug fix): Fixed a race condition where read-only requests
during replica removal (e.g. during range merges or rebalancing) could
be evaluated on the removed replica, returning an empty result.

@erikgrinaker erikgrinaker self-assigned this Apr 29, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -186,8 +186,6 @@ func (r *Replica) destroyRaftMuLocked(ctx context.Context, nextReplicaID roachpb
// is one, and removes the in-memory raft state.
func (r *Replica) disconnectReplicationRaftMuLocked(ctx context.Context) {
r.raftMu.AssertHeld()
r.readOnlyCmdMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because it's already held? Can we r.readOnlyCmdMu.AssertHeld() in that case (maybe on master only)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not held, but both callers have already called destroyStatus.Set() with it (temporarily) held. As such, the following code paths (including this one) are guaranteed to not conflict with read-only requests since they will either have completed already or will check destroyStatus and abort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point about using r.readOnlyCmdMu.AssertHeld() to assert proper locking is still a good one. Did you give any thought to pulling accesses of r.mu.destroyStatus into methods on the replica and asserting proper handling of r.mu and r.readOnlyCmdMu?

maybe on master only

That shouldn't be needed. AssertHeld()/AssertRHeld() should be no-ops on prod builds. The assertions are only armed in race builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you give any thought to pulling accesses of r.mu.destroyStatus into methods on the replica and asserting proper handling of r.mu and r.readOnlyCmdMu?

That's a good idea. I'm going to merge this and the other backports, so we have a fix ready to go, and then submit a follow-up PR to clean up the lock handling a bit. I'd also like to write a quick test for the write path in #46329 first, to see if it has a similar problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any part of this was done manually, please point it out and I'll pore over it in detail once more.

Replica removal did not synchronize with in-flight read-only requests,
which could cause them to be evaluated on a removed (empty) replica,
returning an empty result.

This patch fixes the problem by locking `Replica.readOnlyCmdMu` during
replica removal, thus either waiting for read-only requests to complete
or not evaluating them.

Release note (bug fix): Fixed a race condition where read-only requests
during replica removal (e.g. during range merges or rebalancing) could
be evaluated on the removed replica, returning an empty result.
@erikgrinaker
Copy link
Contributor Author

If any part of this was done manually, please point it out and I'll pore over it in detail once more.

Just the inclusion of #64378 as well, I'll wait for that to be reviewed again and merged before merging the backports.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -186,8 +186,6 @@ func (r *Replica) destroyRaftMuLocked(ctx context.Context, nextReplicaID roachpb
// is one, and removes the in-memory raft state.
func (r *Replica) disconnectReplicationRaftMuLocked(ctx context.Context) {
r.raftMu.AssertHeld()
r.readOnlyCmdMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point about using r.readOnlyCmdMu.AssertHeld() to assert proper locking is still a good one. Did you give any thought to pulling accesses of r.mu.destroyStatus into methods on the replica and asserting proper handling of r.mu and r.readOnlyCmdMu?

maybe on master only

That shouldn't be needed. AssertHeld()/AssertRHeld() should be no-ops on prod builds. The assertions are only armed in race builds.

@erikgrinaker erikgrinaker merged commit 1753571 into cockroachdb:release-21.1 Apr 30, 2021
@tbg tbg mentioned this pull request Apr 30, 2021
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants