Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: add wait to flaky test #40952

Closed

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Sep 20, 2019

This test has been flaky because the TableDescriptor being read is being
seen as OFFLINE. This issue seems to disappear after waiting a bit. This
also happens during IMPORT and this issue was avoided in a similar way
(see #36832).

This should be removed and #40951 is tracking this issue.

Release justification: Work-around for a flaky test.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang)


pkg/ccl/backupccl/backup_test.go, line 1898 at r1 (raw file):

	// TODO(pbardea): Without this wait we see this test falke under stress.
	// See #40951 for the tracking issue.
	for i := 0; i < 5; i++ {

https://godoc.org/github.com/cockroachdb/cockroach/pkg/testutils#SucceedsSoon ?

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)


pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):

	atomic.StoreInt32(&allowErrors, 0)

	// TODO(pbardea): Without this wait we see this test falke under stress.

nit: falke -> flake

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)


pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):

	atomic.StoreInt32(&allowErrors, 0)

	// TODO(pbardea): Without this wait we see this test falke under stress.

Did you experiment with waiting for there to be a single lease and making sure that that lease was on the newest version? If that doesn’t work it indicates a bug. I guess in theory this already indicates a bug. IIUC we shouldn’t be able to lease an offline table descriptor. Perhaps it has something to do with joining the single flight at an older timestamp when reading the descriptor that isn’t cached.

This test has been flaky because the TableDescriptor being read is being
seen as OFFLINE. This issue seems to disappear after waiting a bit. This
also happens during IMPORT and this issue was avoided in a similar way
(see cockroachdb#36832).

This should be removed and cockroachdb#40951 is tracking this issue.

Release justification: Work-around for a flaky test.

Release note: None
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @lucy-zhang)


pkg/ccl/backupccl/backup_test.go, line 1896 at r2 (raw file):

Previously, ajwerner wrote…

Did you experiment with waiting for there to be a single lease and making sure that that lease was on the newest version? If that doesn’t work it indicates a bug. I guess in theory this already indicates a bug. IIUC we shouldn’t be able to lease an offline table descriptor. Perhaps it has something to do with joining the single flight at an older timestamp when reading the descriptor that isn’t cached.

So it looks like WaitForSingleVersion is supposed to do what we want here, however when I was experimenting with calling this on every table, the test would block and not have any successful runs. Thinking about it now it may be because of the concurrent updates constantly refreshing their leases (I'm not sure if we allow lease renewals when a new table descriptor has been written)? I can see if turning off the concurrent reads unblocks the test with that change.

Branch I was experimenting on: https://github.com/pbardea/cockroach/tree/td-version-waiting

@ajwerner
Copy link
Contributor

I can see if turning off the concurrent reads unblocks the test with that change.

I'm not sure that's all that useful. It might be nice to know what leases you're blocked on. It seems to me that refreshed leases are okay and shouldn't violate the single version property.

Maybe we want to just wait for the old version to no longer be leased. That doesn't guarantee that the node on which you're issuing the new query will have a valid lease for the new version but it should soon. Maybe we should watch until the gw node issuing the query has a valid lease on the new version and no lease on the old version? This is a thread I think should eventually be pulled out. It'd be good to understand just how sane or insane our caching and leasing code is.

@pbardea
Copy link
Contributor Author

pbardea commented Sep 23, 2019

Looks like there was an issue with interactions between OFFLINE tables and the leasing logic. Closing this PR in favour of fixing the underlying issue in #40996.

@pbardea pbardea closed this Sep 23, 2019
@pbardea pbardea deleted the test-concurrent-write-wait branch November 22, 2019 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants