-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup: add wait to flaky test #40952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang)
pkg/ccl/backupccl/backup_test.go, line 1898 at r1 (raw file):
// TODO(pbardea): Without this wait we see this test falke under stress. // See #40951 for the tracking issue. for i := 0; i < 5; i++ {
https://godoc.org/github.com/cockroachdb/cockroach/pkg/testutils#SucceedsSoon ?
891b407
to
a5e6f9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang)
pkg/ccl/backupccl/backup_test.go, line 1898 at r1 (raw file):
Previously, ajwerner wrote…
https://godoc.org/github.com/cockroachdb/cockroach/pkg/testutils#SucceedsSoon ?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)
pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):
atomic.StoreInt32(&allowErrors, 0) // TODO(pbardea): Without this wait we see this test falke under stress.
nit: falke -> flake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)
pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):
atomic.StoreInt32(&allowErrors, 0) // TODO(pbardea): Without this wait we see this test falke under stress.
Did you experiment with waiting for there to be a single lease and making sure that that lease was on the newest version? If that doesn’t work it indicates a bug. I guess in theory this already indicates a bug. IIUC we shouldn’t be able to lease an offline table descriptor. Perhaps it has something to do with joining the single flight at an older timestamp when reading the descriptor that isn’t cached.
This test has been flaky because the TableDescriptor being read is being seen as OFFLINE. This issue seems to disappear after waiting a bit. This also happens during IMPORT and this issue was avoided in a similar way (see cockroachdb#36832). This should be removed and cockroachdb#40951 is tracking this issue. Release justification: Work-around for a flaky test. Release note: None
a5e6f9e
to
e083fbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @lucy-zhang)
pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):
Previously, ajwerner wrote…
Did you experiment with waiting for there to be a single lease and making sure that that lease was on the newest version? If that doesn’t work it indicates a bug. I guess in theory this already indicates a bug. IIUC we shouldn’t be able to lease an offline table descriptor. Perhaps it has something to do with joining the single flight at an older timestamp when reading the descriptor that isn’t cached.
So it looks like WaitForSingleVersion
is supposed to do what we want here, however when I was experimenting with calling this on every table, the test would block and not have any successful runs. Thinking about it now it may be because of the concurrent updates constantly refreshing their leases (I'm not sure if we allow lease renewals when a new table descriptor has been written)? I can see if turning off the concurrent reads unblocks the test with that change.
Branch I was experimenting on: https://github.com/pbardea/cockroach/tree/td-version-waiting
I'm not sure that's all that useful. It might be nice to know what leases you're blocked on. It seems to me that refreshed leases are okay and shouldn't violate the single version property. Maybe we want to just wait for the old version to no longer be leased. That doesn't guarantee that the node on which you're issuing the new query will have a valid lease for the new version but it should soon. Maybe we should watch until the gw node issuing the query has a valid lease on the new version and no lease on the old version? This is a thread I think should eventually be pulled out. It'd be good to understand just how sane or insane our caching and leasing code is. |
Looks like there was an issue with interactions between OFFLINE tables and the leasing logic. Closing this PR in favour of fixing the underlying issue in #40996. |
This test has been flaky because the TableDescriptor being read is being
seen as OFFLINE. This issue seems to disappear after waiting a bit. This
also happens during IMPORT and this issue was avoided in a similar way
(see #36832).
This should be removed and #40951 is tracking this issue.
Release justification: Work-around for a flaky test.
Release note: None