Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: TestStoreRangeMergeRHSLeaseExpiration timeout under testrace #47503

Closed
andreimatei opened this issue Apr 15, 2020 · 3 comments · Fixed by #47508
Closed

kv: TestStoreRangeMergeRHSLeaseExpiration timeout under testrace #47503

andreimatei opened this issue Apr 15, 2020 · 3 comments · Fixed by #47508
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered).

Comments

@andreimatei
Copy link
Contributor

Something got stuck for 45 minutes.

The log only has periodic StorePool missing descriptor for local store.

https://teamcity.cockroachdb.com/viewLog.html?buildId=1876754&buildTypeId=Cockroach_MergeToMaster

Attaching the logs
TestStoreRangeMergeRHSLeaseExpiration.txt

@tbg any chance you can look at this one?

@andreimatei andreimatei added the C-test-failure Broken test (automatically or manually discovered). label Apr 15, 2020
@tbg
Copy link
Member

tbg commented Apr 15, 2020

Repros in $TESTTIMEOUT via

make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=5m 2>&1 | tee log

(5m is too high, probably can set 1m and see it faster)

@tbg
Copy link
Member

tbg commented Apr 15, 2020

The test starts a merge, invalidates the RHS lease, and sends 10 gets to a magic key. The main goroutine of the test waits for all 10 latches to be acquired, but this isn't happening. I can see that all 10 gets return NotLeaseholderError. I think this might just be the case in which the RHS loses the lease before the gets arrive, in which case they never acquire any latches but just bounce off the leaseholder check. The main goroutine is then deadlocked reading from a channel that will never get written to. Going to confirm and send out a fix.

@tbg
Copy link
Member

tbg commented Apr 15, 2020

Ah, close but I got it wrong. The test wants all of the requests to go to the newly minted leaseholder n1 (the lease on n2 was just expired as part of the test), so that they get stuck in a latch trap there. But sometimes some of the requests arrive at n1 before it has the lease, so it bounces early without acking the trap channel, and the test deadlocks. I suppose I can find a way to fix that.

tbg added a commit to tbg/cockroach that referenced this issue Apr 15, 2020
The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.

The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check, which would deadlock the test.

Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.

Before this commit,

> make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log

failed in ~1-2 minutes. Runs just fine for 10+ minutes now.

Closes cockroachdb#47503.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 20, 2020
The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.

The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check (which we've moved around recently), which
would deadlock the test.

Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.

Before this commit,

> make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log

failed in ~1-2 minutes. Runs just fine for 10+ minutes now.

Closes cockroachdb#47503.

Release note: None
craig bot pushed a commit that referenced this issue Apr 21, 2020
47508: kvserver: de-flake TestStoreRangeMergeRHSLeaseExpiration r=andreimatei a=tbg

The test is setting up a race where a RHS leaseholder which is about to
get merged away loses the lease, while serving reads and writes.

The test set up these reads and writes so that they would all get to at
least the span acquisition stage. However, requests could bounce early
on the leaseholder check, which would deadlock the test.

Looking at the test, it's a little surprising that the replica sometimes
does not realize it is supposed to get a lease, but this is one of the
wonkier multiTestContext plus manual clock setup plus a maze of
interceptors, so it's not too surprising and, as I decided, not
worth looking into. Instead, the fix is to tolerate that not all
requests will make it to the latch, which seems like a sane assumption;
the test does not rely on it, it would just like it to happen so that
the intended regression test takes place.

Before this commit,

> make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log

failed in ~1-2 minutes. Runs just fine for 10+ minutes now.

Closes #47503.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in 54605e1 Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants