Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: de-flake TestStoreRangeMergeRHSLeaseExpiration #47508

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

tbg
Copy link
Member

@tbg tbg commented Apr 15, 2020

The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.

The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check, which would deadlock the test.

Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.

Before this commit,

make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log

failed in ~1-2 minutes. Runs just fine for 10+ minutes now.

Closes #47503.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from andreimatei April 15, 2020 11:06
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/kv/kvserver/client_merge_test.go, line 1489 at r1 (raw file):

			// Latch acquired.
		case <-time.After(time.Second):
			// Requests may never make it to the latch acquisition if s1 has not

(if I understand correctly) Instead of this timeout, could we wait for a channel piping through the NotLeaseholderErrors that these requests are rejected with before getting to the latching?


pkg/kv/kvserver/client_merge_test.go, line 1506 at r1 (raw file):

	// Because the merge completed successfully, r2 has ceased to exist. We
	// therefore *must* see either a RangeNotFound or a NotLeaseHolderError

Can you please add some words here to explain how it is that either of these errors is expected?

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvanbenschoten do I remember correctly that we (you) changed something about when we check the lease? I'm starting to think that this made this flaky, but I couldn't remember/find the PR that made that change.
Reason I think that is because the test seems to expect to get latches before leaseholder error.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@nvanbenschoten
Copy link
Member

do I remember correctly that we (you) changed something about when we check the lease? I'm starting to think that this made this flaky, but I couldn't remember/find the PR that made that change.

Yes, that's correct. I wonder that's also the cause of #47231. I'll look into that one.

@tbg tbg force-pushed the TestStoreRangeMergeRHSLeaseExpiration branch 2 times, most recently from 12a0d39 to 9aad048 Compare April 20, 2020 09:46
@tbg tbg requested a review from andreimatei April 20, 2020 09:47
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvserver/client_merge_test.go, line 1489 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

(if I understand correctly) Instead of this timeout, could we wait for a channel piping through the NotLeaseholderErrors that these requests are rejected with before getting to the latching?

Done, thanks for the suggestion.


pkg/kv/kvserver/client_merge_test.go, line 1506 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can you please add some words here to explain how it is that either of these errors is expected?

Done. Only RangeNotFound is expected (since not leaseholder is caught before latches).

@blathers-crl
Copy link

blathers-crl bot commented Apr 20, 2020

❌ The GitHub CI (Cockroach) build has failed on 9aad0482.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/kv/kvserver/client_merge_test.go, line 1514 at r2 (raw file):

	for pErr := range reqErrs {
		_, ok := pErr.GetDetail().(*roachpb.RangeNotFoundError)
		require.True(t, ok, "unxpected error %v", pErr)

require.IsType(*roachpb.RangeNotFoundErr)

The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.

The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check (which we've moved around recently), which
would deadlock the test.

Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.

Before this commit,

> make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log

failed in ~1-2 minutes. Runs just fine for 10+ minutes now.

Closes cockroachdb#47503.

Release note: None
@tbg tbg force-pushed the TestStoreRangeMergeRHSLeaseExpiration branch from 9aad048 to 54605e1 Compare April 21, 2020 11:15
@tbg
Copy link
Member Author

tbg commented Apr 21, 2020

Done, TFTR

bors r=andreimatei

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on 54605e1d.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build succeeded

@craig craig bot merged commit 1eee08a into cockroachdb:master Apr 21, 2020
@tbg tbg deleted the TestStoreRangeMergeRHSLeaseExpiration branch April 21, 2020 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: TestStoreRangeMergeRHSLeaseExpiration timeout under testrace
4 participants