-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiregionccl: flake in TestColdStartLatency due to timeout during user login #96334
Labels
A-authentication
Pertains to authn subsystems
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
skipped-test
T-serverless
Comments
knz
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
A-authentication
Pertains to authn subsystems
labels
Feb 1, 2023
cc @ajwerner as well, in case it relates to some of the changes made for cold starts |
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Feb 2, 2023
See cockroachdb#96334 Epic: None Release note: None
Let's skip it. It does repro relatively readily and I haven't been able to debug it. Here's a PR #96464 |
ajwerner
added
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
GA-blocker
labels
Feb 2, 2023
craig bot
pushed a commit
that referenced
this issue
Feb 3, 2023
96464: multiregionccl: skip TestColdStartLatency r=ajwerner a=ajwerner See #96334 Epic: None Release note: None 96466: sql,auth: fix broken NOSQLLOGIN global privilege r=ecwall a=rafiss fixes #96465 In addition to the bug fix, this cleans up some code so we don't have to construct planners on the fly during authentication. Tests were added, since the initial PR that added the NOSQLLOGIN global privilege did not have any tests. Release note (bug fix): The global NOSQLLOGIN privilege had a bug that made CockroachDB ignore it entirely, so it had no effect. This is now fixed. The bug was introduced in v22.2.0-alpha.1. The NOSQLLOGIN role option is unaffected by this bug. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
rafiss
removed
the
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
label
Mar 1, 2023
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Mar 10, 2023
This test was flakey due to the closed timestamp sometimes not leading far for global tables due to overload, and due to a cached liveness reader not being used in distsql. The former was fixed in previous commits. The latter is fixed here. Fixes: cockroachdb#96334 Release note: None
craig bot
pushed a commit
that referenced
this issue
Mar 14, 2023
98368: multiregionccl,server: use cached sqlliveness.Reader, deflake ColdStartLatencyTest r=ajwerner a=ajwerner #### multitenantccl: deflake ColdStartLatencyTest This test was flakey due to the closed timestamp sometimes not leading far for global tables due to overload, and due to a cached liveness reader not being used in distsql. The former was fixed in previous commits. The latter is fixed here. Fixes: #96334 #### sql: use CachedReader for uses with sqlinstance and the sql builtins The CachedReader won't block, which in multi-region clusters is good. It will mean that in some cases, it'll state that a sessions is alive when it most certainly is not. Currently, nobody needs synchronous semantics. This is a major part of fixing the TestColdStartLatency as sometimes distsql planning would block. That's not acceptable -- the idea that query physical planning can need to wait for a cross-region RPC is unacceptable. #### sqlliveness: re-arrange APIs to clarify when the API was blocking By "default" the implementation of Reader was blocking and there was a method to get a handle to a non-blocking CachedReader(). This asymmetry does not aid understandability. The non-blocking reader came later, but it is generally the more desirable interface. Release note: None 98519: storage: More CheckSSTConflicts fixes r=erikgrinaker a=itsbilal A few additional fixes around CheckSSTConflicts, stats calculations, and Next()ing logic, caught by kvnemesis. Hopefully the last of its kind. Also re-enable kvnemesis testing for range keys in AddSSTable, reverting #98475. Fixes #94141. Fixes #98473. Informs #94876. Epic: none Release note: None 98567: backupccl: use correct version gate for restore checkpointing r=adityamaru a=msbutler PR #97862 introduced a subtle bug which allowed the new restore checkpointing policy to take effect before the 23_1 migrations occured. This patch ensures the new policy only takes effect after all migrations occur. Release note: None Epic: None Co-authored-by: ajwerner <awerner32@gmail.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Mar 14, 2023
Epic: none Informs cockroachdb#96334 Release note: None
I'm removing the GA-blocker label. |
Enough to atleast give the theory we have above a shot. |
exalate-issue-sync
bot
added
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
and removed
T-sql-schema-deprecated
Use T-sql-foundations instead
T-serverless
labels
May 10, 2023
rafiss
added
T-serverless
skipped-test
and removed
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
labels
May 10, 2023
ccing @jeffswenson for help with triaging this. |
jeffswenson
added a commit
to jeffswenson/cockroach
that referenced
this issue
May 19, 2023
I ran the test 2500 times with no flakes. The core source of the flakes is the test depends on every range having an available replica in every region. Any amount of replica movement causes the test to fail because many regions only have a single available replica and temporary unavailability causes the sql servers to pick a replica that is available, but unreachable due to a networking hook injected by the test. There are a few independent changes in this PR: 1. It includes the settings from cockroachdb#100320. The lease transfer setting improves test performance and disabling load based splitting helps avoid replica migration. The test implicitly depends on every replica being available because regions only contain a single replica of some tables and sql servers are not allowed to communicate across regions. 2. Based on the insight from cockroachdb#96334#issuecomment-1492610753 the test checks the span conformance report for all servers instead of only one of the servers. 3. The test no longer triggers scatters while waiting for the span config bounds to conform. The scatters increased test flakiness and have less performance impact after cockroachdb#100349. 4. WaitForFullReplication is called to ensure replication queues are empty and there are no pending range moves. Fixes cockroachdb#96334 Release Note: None
jeffswenson
added a commit
to jeffswenson/cockroach
that referenced
this issue
May 19, 2023
I ran the test 2500 times with no flakes. The core source of the flakes is the test depends on every range having an available replica in every region. Any amount of replica movement causes the test to fail because many regions only have a single replica and temporary unavailability causes the sql servers to pick a replica that is available, but unreachable due to a networking hook injected by the test. There are a few independent changes in this PR: 1. It includes the settings from cockroachdb#100320. The lease transfer setting improves test performance and disabling load based splitting helps avoid replica migration. 2. Based on the insight from cockroachdb#96334#issuecomment-1492610753 the test checks the span conformance report for all servers instead of only one of the servers. 3. The test no longer triggers scatters while waiting for the span config bounds to conform. The scatters increased test flakiness and are not required to achieve conformance as we proactively nudge replicas through the replicate queue after cockroachdb#100349. 4. WaitForFullReplication is called to ensure replication queues are empty and there are no pending range moves. Fixes cockroachdb#96334 Release Note: None
craig bot
pushed a commit
that referenced
this issue
May 19, 2023
103666: multiregionccl: deflake the multi-region cold start test r=JeffSwenson a=JeffSwenson I ran the test 2500 times with no flakes. The core source of the flakes is the test depends on every range having an available replica in every region. Any amount of replica movement causes the test to fail because many regions only have a single replica and temporary unavailability causes the sql servers to pick a replica that is available, but unreachable due to a networking hook injected by the test. There are a few independent changes in this PR: 1. It includes the settings from #100320. The lease transfer setting improves test performance and disabling load based splitting helps avoid replica migration. 2. Based on the insight from #96334 (comment) the test checks the span conformance report for all servers instead of only one of the servers. 3. The test no longer triggers scatters while waiting for the span config bounds to conform. The scatters increased test flakiness and are not required to achieve conformance as we proactively nudge replicas through the replicate queue after #100349. 4. WaitForFullReplication is called to ensure replication queues are empty and there are no pending range moves. Fixes #96334 Release Note: None 103680: roachtest: fix disk-stalled/* roachtests r=jbowens a=jbowens In #103198, I mistakenly constructed a context with a timeout outside the deferral, resulting in the defer'd Unstall always timing out if the remainder of test took longer than a minute. Epic: None Fixes: #103664. Fixes: #103663. Fixes: #103662. Fixes: #103661. Release note: none Co-authored-by: Jeff <swenson@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-authentication
Pertains to authn subsystems
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
skipped-test
T-serverless
This is different from previous flakes and refers to the user account retrieval during session authentication.
Found here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8537864?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true
cc @rafiss for triage
Jira issue: CRDB-24073
The text was updated successfully, but these errors were encountered: