Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: de-flake TestReplicateQueueUpAndDownReplicateNonVoters #75212

Conversation

irfansharif
Copy link
Contributor

Fixes #75135. This test asserted on span configs applying to a scratch
range. When stressing, it appeared that some time we were not seeing the
scratch range adopt the prescribed number of voters/non-voters. Staring
at the test itself, we were only nudging the replication queues for the
first node in the three node test. It's possible for the scratch range
to have been housed on a node other than the first; this commit makes it
so that the test nudges queues on all nodes. For good measure, lets also
ensure that the split queues process everything, ditto for the snapshot
queues.

To repro:

dev test pkg/kv/kvserver \
  -f TestReplicateQueueUpAndDownReplicateNonVoters \
  -v --show-logs --timeout 2m --stress

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @irfansharif)


pkg/kv/kvserver/replicate_queue_test.go, line 322 at r1 (raw file):

		scratchRange.GetRangeID())
	if err := row.Scan(&numNonVoters); err != nil {
		if testutils.IsError(err, "converting NULL to int is unsupported") {

nit: SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1 then you don't have to handle the NULL result.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @irfansharif)

Fixes cockroachdb#75135. This test asserted on span configs applying to a scratch
range. When stressing, it appeared that some time we were not seeing the
scratch range adopt the prescribed number of voters/non-voters. Staring
at the test itself, we were only nudging the replication queues for the
first node in the three node test. It's possible for the scratch range
to have been housed on a node other than the first; this commit makes it
so that the test nudges queues on all nodes. For good measure, lets also
ensure that the split queues process everything, ditto for the snapshot
queues.

To repro:

    dev test pkg/kv/kvserver \
      -f TestReplicateQueueUpAndDownReplicateNonVoters \
      -v --show-logs --timeout 2m --stress

Release note: None
@irfansharif irfansharif force-pushed the 220120.deflake-TestReplicateQueueUpAndDownReplicateNonVoters branch from f85d165 to ee1e9cd Compare January 20, 2022 15:52
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @tbg)


pkg/kv/kvserver/replicate_queue_test.go, line 322 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1 then you don't have to handle the NULL result.

Done.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @irfansharif, and @tbg)


pkg/kv/kvserver/replicate_queue_test.go, line 319 at r2 (raw file):

		`SELECT coalesce(max(array_length(non_voting_replicas, 1)),0) FROM crdb_internal.ranges_no_leases WHERE range_id=$1`,
		scratchRange.GetRangeID())
	require.NoError(t, row.Scan(&numNonVoters))

The log.Warning in the previous version was intentional because this function is being called within a SucceedsSoon.

@aayushshah15
Copy link
Contributor

It's possible for the scratch range
to have been housed on a node other than the first;

Maybe this should've been codified in a comment somewhere but since there's only one voting replica in this test, the leaseholder can only be on that one voter. If this test was flakey before, I'd suspect it to be flakey now.

@aayushshah15
Copy link
Contributor

Do you mind addressing some of these comments before letting this merge? Halting the merge for now.

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 20, 2022

Canceled.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible for the scratch range to have been housed on a node other than the first;

since there's only one voting replica in this test, the leaseholder can only be on that one voter

I'm again not following, these two things don't appear to contradict? It was possible for the leaseholder replica (also the voter) to be on a node other than the first; there are three nodes in the test. The test was previously only forcing queue processes on just the first node, but now it does all three. This at least what I was seeing when repro-ing/looking at logs.

If this test was flakey before, I'd suspect it to be flakey now.

Possible, but I'm not sure if it'd be because of the flake that this patch was squashing. FWIW I did run a bunch of test runs before and after this commit to sanity check my claim.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @arulajmani, and @tbg)


pkg/kv/kvserver/replicate_queue_test.go, line 319 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

The log.Warning in the previous version was intentional because this function is being called within a SucceedsSoon.

I'm not following, this current form works just fine with the SucceedsSoon block. Previously if there were no non-voters, this would pollute the test output with "converting NULL to int is unsupported" errors because of the query and still returning numNonVoters == 0. Now it doesn't, and just returns to the caller with zero. It's functionally no different. At the caller you'll still get the SucceedsSoon warnings that num voters are not what the test expects. Were you perhaps getting at something else I missed?

@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 20, 2022

Just ran under stress before and after the commit a few more times on my GCE worker; reliably fails within a minute before (tried 3 more times), still running with success after this commit 15m later.

@aayushshah15
Copy link
Contributor

Discussed offline, my bad for the false alarm. This all LGTM.

@irfansharif
Copy link
Contributor Author

Thanks!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 20, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 20, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 21, 2022

Build succeeded:

@craig craig bot merged commit 9e22d9d into cockroachdb:master Jan 21, 2022
@irfansharif irfansharif deleted the 220120.deflake-TestReplicateQueueUpAndDownReplicateNonVoters branch January 21, 2022 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestReplicateQueueUpAndDownReplicateNonVoters failed
5 participants