Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: return error from split/merge lock acquisition #19448

Merged
merged 2 commits into from
Oct 23, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 23, 2017

See #19172.

@tbg tbg requested a review from a team October 23, 2017 15:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/store.go, line 3878 at r1 (raw file):

			//
			// Inspired by https://github.com/cockroachdb/cockroach/pull/19404.
			runtime.Gosched()

Gosched can hurt performance, especially in the tail latencies, so I wouldn't say there's no harm. That said, this shouldn't be very common so I doubt it will have an impact. Still, I'm not sure it's worth the risk if there's no reason to add this.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 23, 2017

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/store.go, line 3878 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Gosched can hurt performance, especially in the tail latencies, so I wouldn't say there's no harm. That said, this shouldn't be very common so I doubt it will have an impact. Still, I'm not sure it's worth the risk if there's no reason to add this.

Note that the Gosched is on the err = errRetry path, so I doubt it matters for performance. And, with a loop like this, I can't prove that there's no reason to add this, and unless we can't, I think we should generally add it to avoid this kind of freak deadlock.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 4536 at r2 (raw file):

	rightRng, created, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
	if err != nil {
		log.Fatalf(ctx, "while acquiring split lock: %s", err)

Shouldn't you return the error to the caller in order to force error the Raft command? The code in Store.tryGetOrCreateReplica has enough error paths that I can't convince myself none of them will occur.


pkg/storage/store.go, line 3878 at r1 (raw file):

			//
			// Inspired by https://github.com/cockroachdb/cockroach/pull/19404.
			runtime.Gosched()

Was this addition motivated by a real concern. Store.tryGetOrCreateReplica performs multiple lock operations which will definitely provide stop-the-world safe points for the GC.


Comments from Reviewable

Also report the error to sentry because we suspect it to have caused
the bug referenced below in versions of CockroachDB not running with
this commit.

See cockroachdb#19172.
@tbg
Copy link
Member Author

tbg commented Oct 23, 2017

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 4536 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't you return the error to the caller in order to force error the Raft command? The code in Store.tryGetOrCreateReplica has enough error paths that I can't convince myself none of them will occur.

Good point. I definitely want to learn about the errors out there in the wild though, so I added reporting to sentry. PTAL!


pkg/storage/store.go, line 3878 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Was this addition motivated by a real concern. Store.tryGetOrCreateReplica performs multiple lock operations which will definitely provide stop-the-world safe points for the GC.

No, just paranoia given the recent deadlock in TestRefreshPendingCmds. You're right that the abundance of lock acquisition makes this too paranoid. Removed the commit.


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 2 of 4 files at r3.
Review status: 2 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

LGTM, but the PR title is incorrect now.

@petermattis
Copy link
Collaborator

:lgtm:


Review status: 2 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@tbg tbg changed the title storage: fatal on error in acquireSplitLock storage: return error from split/merge lock acquisition Oct 23, 2017
@tbg
Copy link
Member Author

tbg commented Oct 23, 2017

TFTRs, everyone! @nvanbenschoten changed the title.

@nvanbenschoten
Copy link
Member

I stumbled upon this PR while cleaning up #38954 and am now confused about why setting forcedErr to the result of maybeAcquireSplitMergeLock is correct. We don't expect to ever see an error here, but if we do, I don't think we can silently let it prevent one of the replicas from applying the split/merge. That's the kind of thing that causes replica inconsistencies. I think the only option we have is to fatal.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 30, 2019
If this operation fails then we can't simply return an error and
not apply the command. We need to be deterministic at this point.
The only option we have is to fatal.

See cockroachdb#19448 (comment).

The commit also removes the crash reporting added here, which doesn't
appear to have ever fired after two years.

Release note: None
@tbg
Copy link
Member Author

tbg commented Jul 30, 2019

Not sure what we were thinking there, Fatal looks like the only option.

craig bot pushed a commit that referenced this pull request Jul 30, 2019
39158: storage: clean up below-Raft Split and Merge locking r=bdarnell a=nvanbenschoten

This PR contains three related cleanups that I stumbled into while cleaning up raft entry application in preparation for #38954.

The first cleanup is that we avoid improperly handling errors that may be returned when acquiring the split/merge locks. If this operation fails then we can't simply return an error and not apply the command. We need to be deterministic at this point. The only option we have is to fatal. See #19448 (comment).

The second cleanup is that we then remove stale code that was attempting to recover from failed split and merge application. This error handling only made sense in a pre-proposer evaluated KV world.

The third commit addresses a TODO to assert that the RHS range during a split is not initialized. The TODO looks like it went in before proposer evaluated KV, which protects us against both replays and reproposals.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants