Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: splitPostApply fails with *roachpb.RaftGroupDeletedError #16641

Closed
tbg opened this issue Jun 20, 2017 · 4 comments
Closed

storage: splitPostApply fails with *roachpb.RaftGroupDeletedError #16641

tbg opened this issue Jun 20, 2017 · 4 comments

Comments

@tbg
Copy link
Member

tbg commented Jun 20, 2017

https://sentry.io/cockroach-labs/cockroachdb/issues/298639766/

*errors.errorString: store.go:1821 *roachpb.RaftGroupDeletedError
  File "github.com/cockroachdb/cockroach/pkg/storage/store.go", line 1821, in splitPostApply
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 582, in handleReplicatedEvalResult
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 784, in handleEvalResult
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3757, in processRaftCommand
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2888, in handleRaftReadyRaftMuLocked
...
(5 additional frame(s) were not displayed)

store.go:1821 *roachpb.RaftGroupDeletedError
@dianasaur323 dianasaur323 modified the milestones: 1.2, 1.1 Jun 23, 2017
@petermattis petermattis changed the title splitPostApply fails with *roachpb.RaftGroupDeletedError storage: splitPostApply fails with *roachpb.RaftGroupDeletedError Jun 30, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Jul 6, 2017

Based on the location of the error, it looks like something like this happened:

  1. Node N1 (which has a replica of range R1) crashes (or just falls behind)
  2. Range R1 splits, creating R2
  3. R2 is rebalanced away from N1 (probably concurrently with the next couple of steps)
  4. N1 wakes up and starts catching up
  5. N1 learns of the existence of R2 from a raft message, creating a placeholder Replica
  6. The rebalance in step 3 completes. N1 learns about it and GC's its replica
  7. N1 reaches the split in R1's raft log. It tries to create or update a Replica object for R2, but fails because we have a tombstone.

But step 6 doesn't quite work that way, because R2 could never have become initialized while the split was pending, and we don't currently GC uninitialized replicas. I'm not sure whether there's some path I'm overlooking or if I've gone wrong in one of the other steps.

@cuongdo
Copy link
Contributor

cuongdo commented Sep 18, 2017

This hasn't happened in the last 3 months, so clearing milestone.

@cuongdo cuongdo removed this from the 1.1 milestone Sep 18, 2017
@nvanbenschoten
Copy link
Member

This may be related to #21146.

@bdarnell
Copy link
Contributor

Yes, this is almost certainly a duplicate of #21146. Closing in favor of that one which has more detailed analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants