sql: TestTelemetry failed #75190

cockroach-teamcity · 2022-01-20T03:27:09Z

sql.TestTelemetry failed with artifacts on master @ 54c5a2c645756c8cbd3cf7dff1fd354c290b013f:

=== RUN   TestTelemetry
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2553740697
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestTelemetry
    telemetry_test.go:45: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2553740697
--- FAIL: TestTelemetry (56.18s)
=== RUN   TestTelemetry/planning/server
    telemetry.go:90: 
        testdata/telemetry/planning:191: SELECT x FROM (VALUES (1)) AS b(x) WHERE EXISTS(SELECT * FROM (VALUES (1)) AS a(x) WHERE a.x = b.x)
        expected:
        sql.plan.subquery.correlated
        
        found:
        sql.plan.subquery
        sql.plan.subquery.correlated
        --- FAIL: TestTelemetry/planning/server (2.11s)
=== RUN   TestTelemetry/planning
    --- FAIL: TestTelemetry/planning (3.50s)

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

GOFLAGS=-json

/cc @cockroachdb/sql-queries _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2022-01-20T09:47:14Z

sql.TestTelemetry failed with artifacts on master @ 506412ffdb6d0f6d4952d2599b81cb9294a123a1:

=== RUN   TestTelemetry
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2017684981
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestTelemetry
    telemetry_test.go:45: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestTelemetry2017684981
--- FAIL: TestTelemetry (70.18s)
=== RUN   TestTelemetry/planning/server
    telemetry.go:90: 
        testdata/telemetry/planning:191: SELECT x FROM (VALUES (1)) AS b(x) WHERE EXISTS(SELECT * FROM (VALUES (1)) AS a(x) WHERE a.x = b.x)
        expected:
        sql.plan.subquery.correlated
        
        found:
        sql.plan.subquery
        sql.plan.subquery.correlated
        --- FAIL: TestTelemetry/planning/server (2.15s)
=== RUN   TestTelemetry/planning
    --- FAIL: TestTelemetry/planning (3.93s)

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

TAGS=deadlock
GOFLAGS=-parallel=4

_{This test on roachdash | Improve this report!}

irfansharif · 2022-01-20T13:15:25Z

+cc @rytaft (are we expecting this after #75145?)

rytaft · 2022-01-20T14:14:54Z

+cc @rytaft (are we expecting this after #75145?)

Thanks, @irfansharif. I'm not sure why this test just suddenly started to flake (we haven't changed anything here recently), but I've opened another PR which should hopefully help.

irfansharif · 2022-01-20T14:27:58Z

Could it be due to #73876? Here's an e-mail sent out to KV members, but perhaps should be disseminated more widely.

TL;DR: We're merging a PR with a wide blast radius. Failure modes to look out
for and tell me/Arul/#zone-configs about:
- Panics/assertions/test flakes in pkg/spanconfig/... and pkg/ccl/spanconfigccl
- Tests where range counts don't converge as expected, or as quickly as
  expected.
- Tests where the right range is not found, possibly because the right range
  boundaries were not split upon (yet).
- "Trace analysis" test failures that check round trip time for various things
  (pkg/ccl/multiregionccl), finding a different number of RTTs than the test
  expects.
- Version upgrade failures relating to pkg/spanconfig and friends.
- Tests generally taking a lot longer to run than before, possibly timing out
  test packages/etc.

We're about to land *: enable span configs by default, which switches over all
our tests to stop using the system config span in favor of the span configs
infrastructure. This has a pretty wide blast radius, and while we've
future-proofed our test suite as much as we could during review, it's possible
we missed something. This email talks through a few failure modes we can expect
in CI in the coming weeks, why they happen, and what to do about them.

A lot of tests we've written in CRDB implicitly rely on synchrony assumptions
that no longer hold. They're typically around how quickly splits are induced
for new tables, or how quickly configs are applied. To see why, consider unit
tests that make use of a single node or do a lot of heavy lifting through the
first node in the test cluster (our testing libraries default to the first node
unless opted out explicitly). In either form, when we were using gossip, the
effects of a config change appeared basically instantaneously. There was always
delay with the split/merge queue applying these changes, but that too happened
near instantaneously (single digit milliseconds). The span configs
infrastructure uses rangefeeds internally, and propagation of updates is
subject to the closed timestamp target interval, which today defaults to 3s.
This is fine for production use, but can surface synchrony assumptions made in
tests that aren't so easily masked over. In addition to causing tests to flake,
the PR might cause tests to slow down. For a bunch of these tests we've gone
and lowered the closed timestamp target duration, but it's easy to miss things
and no good way to survey all the tests that now take a while longer to
complete (is there?). Another example are tests that use scratch ranges and
apply configs to it. Previously RANGE DEFAULT applied to the entire keyspace,
but now it does only to keyspaces that the SQL layer knows how to encode. For
tests that care about configuring scratch ranges, there's a testing knob, but
it's an opt-in one.

For most of these tests, making them work with the span configs infra was/is
just a matter of slapping a SucceedsSoon block selectively. But still, this is
hard to see looking at opaque timeouts and slower-than-before tests. If you see
anything similar to the above, tell me/Arul/#zone-configs about it and we'll
happily investigate. Given the blast radius, we'd be lucky if we didn't trip up
anything. The PR enabling was written to be easily revertable, and it's
possible we'll have to. Let us know.

rytaft · 2022-01-20T14:41:37Z

Thanks for the info, @irfansharif. I'm not too familiar with how TestTelemetry works, but it seems to rely on this diagnostics reporter:

cockroach/pkg/server/diagnostics/reporter.go

Line 68 in b08695e

type Reporter struct {

Perhaps that would be affected by #73876? At any rate, the fixes I'm making to deflake this test should work, but it might fail again further down in the file. This failure is just a bit below the last failure that I fixed. I can't trigger this failure with make stress, so it's hard to know what will cause issues.

irfansharif · 2022-01-20T14:43:01Z

Should we be liberal with our allow-lists throughout the file then? Would it be an improper test if we did so? I'll try looking through the diagnostic report to see if anything appears obvious.

rytaft · 2022-01-20T15:03:07Z

I don't think that would be a problem. But since I've already posted #75210, let's just see if it flakes again, in which case I'll update the rest of the file with more specific allow-lists.

75191: cli/demo: fix connection string for shared DB server r=otan,knz a=rafiss fixes #74162 This is needed since the workload fixtures are not installed on the shared server, so it only has the default databases. Release note: None 75210: sql: deflake TestTelemetry r=rytaft a=rytaft This commit deflakes TestTelemetry by adding a more precise feature-allowlist. Fixes #75190 Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

This commit deflakes TestTelemetry by adding a more precise feature-allowlist. Fixes cockroachdb#75190 Release note: None

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 20, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 20, 2022

irfansharif assigned rytaft Jan 20, 2022

rytaft mentioned this issue Jan 20, 2022

sql: deflake TestTelemetry #75210

Merged

craig bot closed this as completed in 4af0f77 Jan 20, 2022

gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022

sql: deflake TestTelemetry

813d9a6

This commit deflakes TestTelemetry by adding a more precise feature-allowlist. Fixes cockroachdb#75190 Release note: None

irfansharif mentioned this issue Jan 28, 2022

spanconfig: address fallout after enabling-by-default #75639

Closed

23 tasks

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: TestTelemetry failed #75190

sql: TestTelemetry failed #75190

cockroach-teamcity commented Jan 20, 2022

cockroach-teamcity commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022

sql: TestTelemetry failed #75190

sql: TestTelemetry failed #75190

Comments

cockroach-teamcity commented Jan 20, 2022

cockroach-teamcity commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022

irfansharif commented Jan 20, 2022

rytaft commented Jan 20, 2022