Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiregionccl: flake in TestColdStartLatency due to timeout during user login #96334

Closed
knz opened this issue Feb 1, 2023 · 13 comments · Fixed by #98368 or #103666
Closed

multiregionccl: flake in TestColdStartLatency due to timeout during user login #96334

knz opened this issue Feb 1, 2023 · 13 comments · Fixed by #98368 or #103666
Assignees
Labels
A-authentication Pertains to authn subsystems branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-serverless

Comments

@knz
Copy link
Contributor

knz commented Feb 1, 2023

This is different from previous flakes and refers to the user account retrieval during session authentication.

Found here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8537864?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

          Error:        Received unexpected error:
                        failed to connect to `host=127.0.0.1 user=foo database=defaultdb`: server error (ERROR: internal error while retrieving user account memberships: operation "get-user-timeout" timed out after 10.002s (given timeout 10s): internal error while retrieving user account: aborted during DistSender.Send: context deadline exceeded (SQLSTATE 28000))

cc @rafiss for triage

Jira issue: CRDB-24073

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) A-authentication Pertains to authn subsystems labels Feb 1, 2023
@rafiss
Copy link
Collaborator

rafiss commented Feb 1, 2023

cc @ajwerner as well, in case it relates to some of the changes made for cold starts

@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Feb 2, 2023
@renatolabs
Copy link
Contributor

ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 2, 2023
@ajwerner
Copy link
Contributor

ajwerner commented Feb 2, 2023

Let's skip it. It does repro relatively readily and I haven't been able to debug it. Here's a PR #96464

@ajwerner ajwerner added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 GA-blocker labels Feb 2, 2023
@ajwerner ajwerner self-assigned this Feb 2, 2023
craig bot pushed a commit that referenced this issue Feb 3, 2023
96464: multiregionccl: skip TestColdStartLatency r=ajwerner a=ajwerner

See #96334

Epic: None

Release note: None

96466: sql,auth: fix broken NOSQLLOGIN global privilege r=ecwall a=rafiss

fixes #96465

In addition to the bug fix, this cleans up some code so we don't have to construct planners on the fly during authentication.

Tests were added, since the initial PR that added the NOSQLLOGIN global privilege did not have any tests.

Release note (bug fix): The global NOSQLLOGIN privilege had a bug that made CockroachDB ignore it entirely, so it had no effect. This is now fixed. The bug was introduced in v22.2.0-alpha.1. The NOSQLLOGIN role option is unaffected by this bug.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Mar 1, 2023
ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 10, 2023
This test was flakey due to the closed timestamp sometimes not leading far
for global tables due to overload, and due to a cached liveness reader
not being used in distsql. The former was fixed in previous commits. The
latter is fixed here.

Fixes: cockroachdb#96334

Release note: None
craig bot pushed a commit that referenced this issue Mar 14, 2023
98368: multiregionccl,server: use cached sqlliveness.Reader, deflake ColdStartLatencyTest r=ajwerner a=ajwerner

#### multitenantccl: deflake ColdStartLatencyTest
This test was flakey due to the closed timestamp sometimes not leading far
for global tables due to overload, and due to a cached liveness reader
not being used in distsql. The former was fixed in previous commits. The
latter is fixed here.

Fixes: #96334


#### sql: use CachedReader for uses with sqlinstance and the sql builtins
The CachedReader won't block, which in multi-region clusters is good. It will
mean that in some cases, it'll state that a sessions is alive when it most
certainly is not. Currently, nobody needs synchronous semantics.

This is a major part of fixing the TestColdStartLatency as sometimes
distsql planning would block. That's not acceptable -- the idea that
query physical planning can need to wait for a cross-region RPC is
unacceptable.

#### sqlliveness: re-arrange APIs to clarify when the API was blocking

By "default" the implementation of Reader was blocking and there was a method
to get a handle to a non-blocking CachedReader(). This asymmetry does not aid
understandability. The non-blocking reader came later, but it is generally the
more desirable interface.

Release note: None

98519: storage: More CheckSSTConflicts fixes r=erikgrinaker a=itsbilal

A few additional fixes around CheckSSTConflicts, stats calculations, and Next()ing logic, caught by kvnemesis. Hopefully the last of its kind.

Also re-enable kvnemesis testing for range keys in AddSSTable, reverting #98475.

Fixes #94141.
Fixes #98473.
Informs #94876.

Epic: none

Release note: None

98567: backupccl: use correct version gate for restore checkpointing r=adityamaru a=msbutler

PR #97862 introduced a subtle bug which allowed the new restore checkpointing policy to take effect before the 23_1 migrations occured. This patch ensures the new policy only takes effect after all migrations occur.

Release note: None

Epic: None

Co-authored-by: ajwerner <awerner32@gmail.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@craig craig bot closed this as completed in a902d6a Mar 14, 2023
@ajwerner ajwerner reopened this Mar 14, 2023
ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 14, 2023
Epic: none

Informs cockroachdb#96334

Release note: None
@ajwerner
Copy link
Contributor

ajwerner commented Apr 4, 2023

I'm removing the GA-blocker label.

@arulajmani
Copy link
Collaborator

@arulajmani any chance you've been sniped into wanting to look into this?

Enough to atleast give the theory we have above a shot.

@arulajmani arulajmani self-assigned this Apr 5, 2023
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead T-serverless labels May 10, 2023
@rafiss rafiss added T-serverless skipped-test and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels May 10, 2023
@rafiss
Copy link
Collaborator

rafiss commented May 18, 2023

ccing @jeffswenson for help with triaging this.

jeffswenson added a commit to jeffswenson/cockroach that referenced this issue May 19, 2023
I ran the test 2500 times with no flakes. The core source of the flakes
is the test depends on every range having an available replica in every
region. Any amount of replica movement causes the test to fail because
many regions only have a single available replica and temporary
unavailability causes the sql servers to pick a replica that is
available, but unreachable due to a networking hook injected by the
test.

There are a few independent changes in this PR:

1. It includes the settings from cockroachdb#100320. The lease transfer setting
	 improves test performance and disabling load based splitting helps
	 avoid replica migration. The test implicitly depends on every
	 replica being available because regions only contain a single replica
	 of some tables and sql servers are not allowed to communicate across
	 regions.
2. Based on the insight from cockroachdb#96334#issuecomment-1492610753 the test
	 checks the span conformance report for all servers instead of only
	 one of the servers.
3. The test no longer triggers scatters while waiting for the span
	 config bounds to conform. The scatters increased test flakiness and
	 have less performance impact after cockroachdb#100349.
4. WaitForFullReplication is called to ensure replication queues are
	 empty and there are no pending range moves.

Fixes cockroachdb#96334

Release Note: None
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue May 19, 2023
I ran the test 2500 times with no flakes. The core source of the flakes
is the test depends on every range having an available replica in every
region. Any amount of replica movement causes the test to fail because
many regions only have a single replica and temporary unavailability
causes the sql servers to pick a replica that is available, but
unreachable due to a networking hook injected by the test.

There are a few independent changes in this PR:

1. It includes the settings from cockroachdb#100320. The lease transfer setting
   improves test performance and disabling load based splitting helps
   avoid replica migration.
2. Based on the insight from cockroachdb#96334#issuecomment-1492610753 the test
   checks the span conformance report for all servers instead of only
   one of the servers.
3. The test no longer triggers scatters while waiting for the span
   config bounds to conform. The scatters increased test flakiness
   and are not required to achieve conformance as we proactively nudge
   replicas through the replicate queue after cockroachdb#100349.
4. WaitForFullReplication is called to ensure replication queues are
   empty and there are no pending range moves.

Fixes cockroachdb#96334

Release Note: None
craig bot pushed a commit that referenced this issue May 19, 2023
103666: multiregionccl: deflake the multi-region cold start test r=JeffSwenson a=JeffSwenson

I ran the test 2500 times with no flakes. The core source of the flakes
is the test depends on every range having an available replica in every
region. Any amount of replica movement causes the test to fail because
many regions only have a single replica and temporary unavailability
causes the sql servers to pick a replica that is available, but
unreachable due to a networking hook injected by the test.

There are a few independent changes in this PR:

1. It includes the settings from #100320. The lease transfer setting
   improves test performance and disabling load based splitting helps
   avoid replica migration.
2. Based on the insight from #96334 (comment) the test
   checks the span conformance report for all servers instead of only
   one of the servers.
3. The test no longer triggers scatters while waiting for the span
   config bounds to conform. The scatters increased test flakiness
   and are not required to achieve conformance as we proactively nudge
   replicas through the replicate queue after #100349.
4. WaitForFullReplication is called to ensure replication queues are
   empty and there are no pending range moves.

Fixes #96334

Release Note: None

103680: roachtest: fix disk-stalled/* roachtests r=jbowens a=jbowens

In #103198, I mistakenly constructed a context with a timeout outside the deferral, resulting in the defer'd Unstall always timing out if the remainder of test took longer than a minute.

Epic: None
Fixes: #103664.
Fixes: #103663.
Fixes: #103662.
Fixes: #103661.
Release note: none

Co-authored-by: Jeff <swenson@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
@craig craig bot closed this as completed in #103666 May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-authentication Pertains to authn subsystems branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-serverless
Projects
None yet
6 participants