Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestSQLStatsDataDriven is flaky #89861

Closed
fantapop opened this issue Oct 12, 2022 · 5 comments · Fixed by #98337
Closed

TestSQLStatsDataDriven is flaky #89861

fantapop opened this issue Oct 12, 2022 · 5 comments · Fixed by #98337
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test

Comments

@fantapop
Copy link
Collaborator

fantapop commented Oct 12, 2022

It looks like TestSQLStatsDataDriven occasionally flakes. 2 out of the last 1000 runs.

Screen Shot 2022-10-12 at 11 19 12 AM

Failed
=== RUN TestSQLStatsDataDriven
test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/83b49f90eb84fbc5d893648fc9fd7294/logTestSQLStatsDataDriven2913649257
test_log_scope.go:79: use -show-logs to present logs inline
=== CONT TestSQLStatsDataDriven
datadriven_test.go:223: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/83b49f90eb84fbc5d893648fc9fd7294/logTestSQLStatsDataDriven2913649257
--- FAIL: TestSQLStatsDataDriven (8.34s)

Its hard to tell whats going on here. But its flaked again recently. I see this at the end of the log:

W221012 19:26:22.304027 2704 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n2] 264 closed timestamps side-transport connection dropped from node: 1
W221012 19:26:22.304076 1929 jobs/registry.go:824 â‹® [n2] 265 canceling all adopted jobs due to stopper quiescing
I221012 19:26:22.305595 2435 sql/stats/automatic_stats.go:509 â‹® [n3] 266 quiescing auto stats refresher
W221012 19:26:22.305700 2441 jobs/registry.go:824 â‹® [n3] 267 canceling all adopted jobs due to stopper quiescing
W221012 19:26:22.307135 2423 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n3] 268 exiting heartbeat loop
W221012 19:26:22.307801 1049 jobs/registry.go:824 â‹® [n1] 269 canceling all adopted jobs due to stopper quiescing
W221012 19:26:22.307981 1031 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n1] 270 exiting heartbeat loop
I221012 19:26:22.308208 1226 jobs/registry.go:1217 â‹® [n1] 271 AUTO SPAN CONFIG RECONCILIATION job 804560481633009665: stepping through state succeeded with error:
W221012 19:26:22.308299 1911 sql/sqlliveness/slinstance/slinstance.go:250 â‹® [n2] 272 exiting heartbeat loop
W221012 19:26:22.308306 1226 kv/txn.go:705 ⋮ [n1] 273 failure aborting transaction: ‹node unavailable; try another peer›; abort caused by: context canceled
W221012 19:26:22.308609 2819 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n3] 274 closed timestamps side-transport connection dropped from node: 1

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/6911488?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildDeploymentsSection=true&expandBuildTestsSection=true

Jira issue: CRDB-20469

@fantapop fantapop added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-table-stats Table statistics (and their automatic refresh). labels Oct 12, 2022
@michae2 michae2 added the T-sql-queries SQL Queries Team label Oct 12, 2022
@matthewtodd matthewtodd added A-sql-observability Related to observability of the SQL layer and removed T-sql-queries SQL Queries Team A-sql-table-stats Table statistics (and their automatic refresh). labels Oct 13, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue Oct 27, 2022
Skipping the test since its flaky.

Informs: cockroachdb#89861

Release note: None
@adityamaru
Copy link
Contributor

craig bot pushed a commit that referenced this issue Oct 27, 2022
90744: persistedsqlstats: skip logical_plan_sampling_for_explicit_txn r=matthewtodd a=adityamaru

Skipping the test since its flaky.

Informs: #89861

Release note: None

Co-authored-by: adityamaru <adityamaru@gmail.com>
blathers-crl bot pushed a commit that referenced this issue Nov 10, 2022
Skipping the test since its flaky.

Informs: #89861

Release note: None
craig bot pushed a commit that referenced this issue Mar 15, 2023
98337: sql: enable TestSQLStatsDataDriven r=j82w a=j82w

Adding retry logic to the execute command, and
enabling the test. All previous builds with failures 
are no longer available. No failures on testing
locally with stress.

Epic: none
Closes: #89861

Release note: none

Co-authored-by: j82w <jwilley@cockroachlabs.com>
@craig craig bot closed this as completed in 6286ba1 Mar 15, 2023
@andrewbaptist
Copy link
Collaborator

I see this was just unskipped - but it just failed again for me:

Looks like several others had the same failure: https://teamcity.cockroachdb.com/test/6015532732023785721?currentProjectId=Cockroach_Ci_TestsGcpLinuxX8664BigVm&expandTestHistoryChartSection=true

@andrewbaptist andrewbaptist reopened this Mar 18, 2023
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Mar 21, 2023
This commit skips TestSQLStatsDataDriven. This test is flaky.

Part of cockroachdb#89861.

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Mar 21, 2023
…99123

98712: storage: expose separate WriteBatch interface r=sumeerbhola a=jbowens

Adapt NewUnindexedBatch to no longer take a writeOnly parameter. Instead, a new NewWriteBatch method is exposed that returns a WriteBatch type that does not provide Reader facilities. Future work may remove UnindexedBatch altogether, updating callers to explicitly maintain separate Readers and WriteBatches.

Epic: None
Release note: None

98807: spanconfig: add TestBlockedKVSubscriberDisablesQueues r=irfansharif a=irfansharif

Regression test for #98422 which disables {split,replicate,mvccGC} queues until subscribed to span configs. We're only slightly modifying the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges.

Release note: None

98823: sql: address a couple of old TODOs r=yuzefovich a=yuzefovich

This commit addresses a couple of old TODOs in the `connExecutor.dispatchToExecutionEngine` method.

First, it simply removes the now-stale TODO about needing to `Step()` the txn before executing the query. The TODO mentioned things about the old way FK checks were performed, but we now always run checks in a deferred fashion, and perform the `Step`s correctly there for cascades and checks, thus, the TODO can be removed. (The TODO itself explicitly mentions that it can be removed once we do checks and cascades in the deferred fashion.)

Second, it addresses a TODO about how some plan information is saved. Up until 961e66f the cleanup of `planNode` tree and `flow` infra was intertwined, so in 6ae4881 in order to "sample" the plan with correct info (some of which is only available _after_ the execution is done) we delegated that sampling to `planTop.close`. However, with `planNode` tree and `flow` cleanups refactored and de-coupled, we no longer have to close the plan early on. As a result, we now close the plan in a defer right after we create it (now it's the only place we do so), and this commit makes it so that we record the plan info explicitly right after returning from the execution engine, thus, addressing the second TODO.

Epic: None

Release note: None

98896: tracing: fix WithEventListeners option with verbose parent r=yuzefovich a=yuzefovich

Previously, when using `WithEventListeners` span option and not explicitly specifying `WithRecording` option, we could incorrectly use the structured recording. In particular, this was the case when the parent has verbose recording, and this is now fixed. At the moment, this span option is used only by the backup and restore processors, so the impact would be that their verbose recording wouldn't be included into the session recording (perhaps we don't actually expose that anyway).

Epic: None

Release note: None

99079: sql: fix hint for unknown udfs in views and funcs r=rharding6373 a=rharding6373

Fixes a hint for some instances of an "unknown function" error that was not properly applied previously.

Informs: #99002
Epic: None

Release note: None

99095: ui: fix inflight request replacement, cleanup database details api r=THardy98 a=THardy98

Changes to allow in-flight request replacement in this PR: #98331 caused breaking changes to `KeyedCachedDataReducer`s. Despite the added `allowReplacementRequests` flag being `false`, in-flight requests would be replaced (specifically, when fetching for initial state). This would prevent all requests but the last one from being stored in the reducer state and leave the remaining requests in a permanent `inFlight` status. Consequently, our pages would break as the permanent `inFlight` status on these requests would prevent us from ever fetching data, putting these pages in a permanent loading state.

This PR adds checks to ensure `allowReplacementRequests` is enabled when we receive a response, before deciding whether to omit it from the reducer state.

It's important to note that this is still not an ideal solution for `KeyedCachedDataReducer`s, should we ever want to replace inflight requests for one. The checks to replace an in-flight request still occur at the reducer-level, whereas in a `KeyedCachedDataReducer` we'd like them to occur at the state-level (i.e. instead of replacement checks for all requests across the reducer, we should check for requests that impact a specific key's state). This way we scope replacements to each key in the reducer, allowing us to fetch data for each key independently (which is the intended functionality). Enabling this for a `KeyedCachedDataReducer` without making this change I believe will cause the breaking behaviour above.

This PR also includes some cleanup to the database details API, namely:
- camel casing response fields
- encapsulating the database span stats response into its own object (instead of across multiple objects), this helps scope any errors deriving from the query to the single stats object

**BEFORE**
https://www.loom.com/share/57c9f278376a4551977398316d6ed7b6

**AFTER**
https://www.loom.com/share/b2d4e074afca4b42a4f711171de3fd72

Release note (bug fix): Fixed replacement of in-flight requests for `KeyedCachedDataReducer`s to prevent permanent loading on requests stuck on an `inFlight` status.

99097: upgrades: fix backfilling of special roles in user ID migrations r=rafiss a=andyyang890

upgrades: fix system.privileges user ID migration

This patch adds a missing case for the public role in the backfilling
portion of the system.privileges user ID migration.

Release note: None

---

upgrades: fix system.database_role_settings user ID migration

This patch adds a missing case for the empty role in the backfilling
portion of the system.database_role_settings user ID migration.

Release note: None

---

Part of #87079

99113: server: use no-op cost controller for shared-process tenants r=knz a=stevendanna

In the long run, all rate limiting of tenants should be controlled by
a tenant capability.

However, at the moment, we do not have the infrastructure to read
tenant capabilities from the tenant process.

Since, the main user of shared-process tenants is the new,
experimental unified-architecture were we do not anticipate needing
the cost controller for most users, this PR injects a no-op cost
controller for shared-process tenants.

Epic: CRDB-23559

Release note: None

99120: tracing: allow collection of stack history for Structured spans as well r=dt a=adityamaru

Previously, we mandated that the span must be Verbose to be able to collect and emit a Structured event containing stack history. This change permits Structured spans to also collect stack history.

Release note: None
Epic: none

99123: sql: skip flaky TestSQLStatsDataDriven r=ericharmeling a=ericharmeling

This commit skips TestSQLStatsDataDriven. This test is flaky.

Part of #89861.

Epic: none

Release note: None

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>
Co-authored-by: Thomas Hardy <thardy@cockroachlabs.com>
Co-authored-by: Andy Yang <yang@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
@maryliag maryliag added T-cluster-observability and removed A-sql-observability Related to observability of the SQL layer T-sql-observability labels Jul 5, 2023
@j82w
Copy link
Contributor

j82w commented Jul 6, 2023

Duplicate: #89861

@j82w j82w closed this as completed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants