-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: TestStoreRangeMergeRHSLeaseExpiration timeout under testrace #47503
Comments
Repros in $TESTTIMEOUT via make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=5m 2>&1 | tee log (5m is too high, probably can set 1m and see it faster) |
The test starts a merge, invalidates the RHS lease, and sends 10 gets to a magic key. The main goroutine of the test waits for all 10 latches to be acquired, but this isn't happening. I can see that all 10 gets return NotLeaseholderError. I think this might just be the case in which the RHS loses the lease before the gets arrive, in which case they never acquire any latches but just bounce off the leaseholder check. The main goroutine is then deadlocked reading from a channel that will never get written to. Going to confirm and send out a fix. |
Ah, close but I got it wrong. The test wants all of the requests to go to the newly minted leaseholder n1 (the lease on n2 was just expired as part of the test), so that they get stuck in a latch trap there. But sometimes some of the requests arrive at n1 before it has the lease, so it bounces early without acking the trap channel, and the test deadlocks. I suppose I can find a way to fix that. |
The test is setting up a race where a RHS leaseholder which is about to get merged away loses the lease, while serving reads and writes. The test set up these reads and writes so that they would all get to at least the span acquisition stage. However, requests could bounce early on the leaseholder check, which would deadlock the test. Looking at the test, it's a little surprising that the replica sometimes does not realize it is supposed to get a lease, but this is one of the wonkier multiTestContext plus manual clock setup plus a maze of interceptors, so it's not too surprising and, as I decided, not worth looking into. Instead, the fix is to tolerate that not all requests will make it to the latch, which seems like a sane assumption; the test does not rely on it, it would just like it to happen so that the intended regression test takes place. Before this commit, > make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log failed in ~1-2 minutes. Runs just fine for 10+ minutes now. Closes cockroachdb#47503. Release note: None
The test is setting up a race where a RHS leaseholder which is about to get merged away loses the lease, while serving reads and writes. The test set up these reads and writes so that they would all get to at least the span acquisition stage. However, requests could bounce early on the leaseholder check (which we've moved around recently), which would deadlock the test. Looking at the test, it's a little surprising that the replica sometimes does not realize it is supposed to get a lease, but this is one of the wonkier multiTestContext plus manual clock setup plus a maze of interceptors, so it's not too surprising and, as I decided, not worth looking into. Instead, the fix is to tolerate that not all requests will make it to the latch, which seems like a sane assumption; the test does not rely on it, it would just like it to happen so that the intended regression test takes place. Before this commit, > make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log failed in ~1-2 minutes. Runs just fine for 10+ minutes now. Closes cockroachdb#47503. Release note: None
47508: kvserver: de-flake TestStoreRangeMergeRHSLeaseExpiration r=andreimatei a=tbg The test is setting up a race where a RHS leaseholder which is about to get merged away loses the lease, while serving reads and writes. The test set up these reads and writes so that they would all get to at least the span acquisition stage. However, requests could bounce early on the leaseholder check, which would deadlock the test. Looking at the test, it's a little surprising that the replica sometimes does not realize it is supposed to get a lease, but this is one of the wonkier multiTestContext plus manual clock setup plus a maze of interceptors, so it's not too surprising and, as I decided, not worth looking into. Instead, the fix is to tolerate that not all requests will make it to the latch, which seems like a sane assumption; the test does not rely on it, it would just like it to happen so that the intended regression test takes place. Before this commit, > make stress PKG=./pkg/kv/kvserver/ TESTS=TestStoreRangeMergeRHSLeaseExpiration TESTTIMEOUT=1m 2>&1 | tee log failed in ~1-2 minutes. Runs just fine for 10+ minutes now. Closes #47503. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Something got stuck for 45 minutes.
The log only has periodic
StorePool missing descriptor for local store
.https://teamcity.cockroachdb.com/viewLog.html?buildId=1876754&buildTypeId=Cockroach_MergeToMaster
Attaching the logs
TestStoreRangeMergeRHSLeaseExpiration.txt
@tbg any chance you can look at this one?
The text was updated successfully, but these errors were encountered: