-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: de-flake TestStoreRangeMergeRHSLeaseExpiration #47508
kvserver: de-flake TestStoreRangeMergeRHSLeaseExpiration #47508
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
pkg/kv/kvserver/client_merge_test.go, line 1489 at r1 (raw file):
// Latch acquired. case <-time.After(time.Second): // Requests may never make it to the latch acquisition if s1 has not
(if I understand correctly) Instead of this timeout, could we wait for a channel piping through the NotLeaseholderErrors
that these requests are rejected with before getting to the latching?
pkg/kv/kvserver/client_merge_test.go, line 1506 at r1 (raw file):
// Because the merge completed successfully, r2 has ceased to exist. We // therefore *must* see either a RangeNotFound or a NotLeaseHolderError
Can you please add some words here to explain how it is that either of these errors is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanbenschoten do I remember correctly that we (you) changed something about when we check the lease? I'm starting to think that this made this flaky, but I couldn't remember/find the PR that made that change.
Reason I think that is because the test seems to expect to get latches before leaseholder error.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
Yes, that's correct. I wonder that's also the cause of #47231. I'll look into that one. |
12a0d39
to
9aad048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/client_merge_test.go, line 1489 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
(if I understand correctly) Instead of this timeout, could we wait for a channel piping through the
NotLeaseholderErrors
that these requests are rejected with before getting to the latching?
Done, thanks for the suggestion.
pkg/kv/kvserver/client_merge_test.go, line 1506 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Can you please add some words here to explain how it is that either of these errors is expected?
Done. Only RangeNotFound is expected (since not leaseholder is caught before latches).
❌ The GitHub CI (Cockroach) build has failed on 9aad0482. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
pkg/kv/kvserver/client_merge_test.go, line 1514 at r2 (raw file):
for pErr := range reqErrs { _, ok := pErr.GetDetail().(*roachpb.RangeNotFoundError) require.True(t, ok, "unxpected error %v", pErr)
require.IsType(*roachpb.RangeNotFoundErr)
The test is setting up a race where a RHS leaseholder which is about to get merged away loses the lease, while serving reads and writes. The test set up these reads and writes so that they would all get to at least the span acquisition stage. However, requests could bounce early on the leaseholder check (which we've moved around recently), which would deadlock the test. Looking at the test, it's a little surprising that the replica sometimes does not realize it is supposed to get a lease, but this is one of the wonkier multiTestContext plus manual clock setup plus a maze of interceptors, so it's not too surprising and, as I decided, not worth looking into. Instead, the fix is to tolerate that not all requests will make it to the latch, which seems like a sane assumption; the test does not rely on it, it would just like it to happen so that the intended regression test takes place. Before this commit, > make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log failed in ~1-2 minutes. Runs just fine for 10+ minutes now. Closes cockroachdb#47503. Release note: None
9aad048
to
54605e1
Compare
Done, TFTR bors r=andreimatei |
❌ The GitHub CI (Cockroach) build has failed on 54605e1d. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Build succeeded |
The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.
The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check, which would deadlock the test.
Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.
Before this commit,
failed in ~1-2 minutes. Runs just fine for 10+ minutes now.
Closes #47503.
Release note: None