Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix TestCreateStatisticsCanBeCancelled txn retry hang #109029

Merged
merged 1 commit into from
Aug 21, 2023

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Aug 18, 2023

Previously, this test could hang if there was an automatic
stats came in concurrently with a manual stats collection,
where the request filter would end up hanging and being called twice.
To address this patch will disable automatic stats collections
on the table.

Fixes: #109007

Release note: None

@fqazi fqazi requested a review from a team August 18, 2023 18:22
@fqazi fqazi requested a review from a team as a code owner August 18, 2023 18:22
@fqazi fqazi requested review from mgartner and removed request for a team August 18, 2023 18:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2 michae2 requested review from michae2 and removed request for mgartner August 18, 2023 18:24
@fqazi fqazi force-pushed the createStatsRetryFix branch 2 times, most recently from 3541485 to 3668aa4 Compare August 18, 2023 18:25
@rafiss rafiss added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 18, 2023
@fqazi fqazi force-pushed the createStatsRetryFix branch from 3668aa4 to ccfad8a Compare August 18, 2023 18:26
@rafiss
Copy link
Collaborator

rafiss commented Aug 18, 2023

#109007 is an issue for TestCreateStatisticsCanBeCancelled, but this PR title is talking about TestAtMostOneRunningCreateStats. do we have multiple flakes?

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 18, 2023

@rafiss This was a Stupid mistake on my part. I accidentally moved the code around before pushing this. Let me validate one more time, turning to draft temporarily.

@fqazi fqazi marked this pull request as draft August 18, 2023 18:29
@fqazi fqazi marked this pull request as ready for review August 18, 2023 18:45
@fqazi
Copy link
Collaborator Author

fqazi commented Aug 18, 2023

RFAL

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/stats/create_stats_job_test.go line 148 at r2 (raw file):

	sqlDB.Exec(t, `CREATE DATABASE d`)
	sqlDB.Exec(t, `CREATE TABLE d.t (x INT PRIMARY KEY)`)

Let's create this table with WITH (sql_stats_automatic_collection_enabled = false). That will prevent auto stats from running.


pkg/sql/stats/create_stats_job_test.go line 164 at r2 (raw file):

	// If we end up ever retrying the txn, we should not block on
	// the allow request channel again.
	setTableID(descpb.InvalidID)

Which txn is retrying? I'm not sure it's possible for the statistics collection txn to retry if we're waiting in the store testing knob. But maybe there's a race between automatic stats collection for s1 and manual stats collection?

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 18, 2023

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)

pkg/sql/stats/create_stats_job_test.go line 148 at r2 (raw file):

	sqlDB.Exec(t, `CREATE DATABASE d`)
	sqlDB.Exec(t, `CREATE TABLE d.t (x INT PRIMARY KEY)`)

Let's create this table with WITH (sql_stats_automatic_collection_enabled = false). That will prevent auto stats from running.

pkg/sql/stats/create_stats_job_test.go line 164 at r2 (raw file):

	// If we end up ever retrying the txn, we should not block on
	// the allow request channel again.
	setTableID(descpb.InvalidID)

Which txn is retrying? I'm not sure it's possible for the statistics collection txn to retry if we're waiting in the store testing knob. But maybe there's a race between automatic stats collection for s1 and manual stats collection?

Looking at the stack again, it is definitely auto stats collection fighting the manual one. Oh good point, yeah the txn probably will be AOST in that case too. So, that makes a lot more sense here (disabling the knob is probably a good idea to be paranoid still).

@fqazi fqazi changed the title sql: fix TestAtMostOneRunningCreateStats txn retry hang sql: fix TestCreateStatisticsCanBeCancelled txn retry hang Aug 18, 2023
@fqazi fqazi force-pushed the createStatsRetryFix branch 2 times, most recently from f8b9d82 to 4cef84b Compare August 18, 2023 20:10
@fqazi fqazi requested a review from michae2 August 18, 2023 20:10
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for the quick fix! 🙂

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 18, 2023

@michae2 TFTR!

@rafiss
Copy link
Collaborator

rafiss commented Aug 18, 2023

hm seems like CI failed on the test again

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/stats/create_stats_job_test.go line 131 at r3 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	skip.UnderStress(t, "test can be slow to quiesce because of filter")

Let's unskip this and make sure it passes under stress.


pkg/sql/stats/create_stats_job_test.go line 163 at r3 (raw file):

	// If we end up ever retrying the txn, we should not block on
	// the allow request channel again.
	setTableID(descpb.InvalidID)

Statistics collection runs in parallel. Maybe multiple scans hit the knob at once?


pkg/sql/stats/create_stats_job_test.go line 174 at r3 (raw file):

	})
	err := <-errCh
	allowRequest <- struct{}{}

Instead of just waiting for cancellation at this point, maybe we need another SucceedsSoon loop that is repeatedly sending to allowRequest and checking errCh in a select.

@fqazi fqazi force-pushed the createStatsRetryFix branch from 4cef84b to 9071d57 Compare August 21, 2023 15:17
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)


pkg/sql/stats/create_stats_job_test.go line 131 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Let's unskip this and make sure it passes under stress.

@michae2 I did that and switched to post that channel in a repeatedly succeeds soon. Let's see if CI is happy under stress.

@fqazi fqazi force-pushed the createStatsRetryFix branch from 9071d57 to a8bb526 Compare August 21, 2023 16:53
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, it passed! Some comments but :lgtm:

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)


pkg/sql/stats/create_stats_job_test.go line 152 at r5 (raw file):

	setTableID(tID)

	// Run CREATE STATISTICS and wait for to create the job.

nit: maybe "Run CREATE STATISTICS and wait for it to create the job."


pkg/sql/stats/create_stats_job_test.go line 170 at r5 (raw file):

		return err
	})
	// Allow the filter to pass everything an error is received.

nit: maybe "Allow the filter to pass everything until an error is received."


pkg/sql/stats/create_stats_job_test.go line 180 at r5 (raw file):

				return nil
			case allowRequest <- struct{}{}:
			}

Seems like there should be a default case so that we return control to the SucceedsSoon if neither channel is ready?

Previously, this test could hang if there was an automatic
stats came in concurrently with a manual stats collection,
where the request filter would end up hanging and being called twice.
To address this patch will disable automatic stats collections
on the table.

Fixes: cockroachdb#109007
Release note: None
@fqazi fqazi force-pushed the createStatsRetryFix branch from a8bb526 to 0d7ad67 Compare August 21, 2023 19:08
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 21, 2023

@michae2 TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 21, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Aug 21, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-109029 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/109187/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 0d7ad67 to blathers/backport-release-23.1.9-rc-109029: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.9-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 22, 2023

blathers backport release-23.1.9-rc

@blathers-crl
Copy link

blathers-crl bot commented Aug 22, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1.9-rc-109029 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/109231/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch release-23.1.9-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/stats: TestCreateStatisticsCanBeCancelled failed
4 participants