Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/multiregionccl: TestMultiRegionDataDriven failed #74783

Closed
cockroach-teamcity opened this issue Jan 13, 2022 · 6 comments
Closed

ccl/multiregionccl: TestMultiRegionDataDriven failed #74783

cockroach-teamcity opened this issue Jan 13, 2022 · 6 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 13, 2022

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ c3d71ac887844bef174abb6dab2a4e1ce9270ab7:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven853361253
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven853361253
--- FAIL: TestMultiRegionDataDriven (42.01s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:226: SELECT * FROM db.rbt AS OF SYSTEM TIME follower_read_timestamp() WHERE k = 1
        expected:
        served locally: true
        served via follower read: true
        
        found:
        served locally: true
        served via follower read: false
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (11.54s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

Same failure on other branches

/cc @cockroachdb/multiregion

This test on roachdash | Improve this report!

Jira issue: CRDB-12258

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 13, 2022
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven3462472544
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven3462472544
--- FAIL: TestMultiRegionDataDriven (97.05s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:101: SELECT * FROM db.rbt WHERE k = 1
        expected:
        served locally: true
        served via follower read: false
        
        found:
        served locally: false
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (11.06s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ 9b0065ca2450205a34d4237be7b317c2d895658c:

=== RUN   TestMultiRegionDataDriven
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven2898876028
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestMultiRegionDataDriven
    datadriven_test.go:390: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestMultiRegionDataDriven2898876028
--- FAIL: TestMultiRegionDataDriven (41.73s)
=== RUN   TestMultiRegionDataDriven/regional_by_table
    datadriven_test.go:122: 
        testdata/regional_by_table:154: SELECT * FROM db.rbt AS OF SYSTEM TIME follower_read_timestamp() WHERE k = 1
        expected:
        served locally: false
        
        found:
        served locally: true
        served via follower read: true
    --- FAIL: TestMultiRegionDataDriven/regional_by_table (12.23s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDataDriven failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

../../util/stop/stopper.go:327 stop.(*Stopper).RunTask { f(ctx) }
../../server/node.go:863 server.(*Node).writeNodeStatus { err = n.recorder.WriteNodeStatus(ctx, n.storeCfg.DB, *nodeStatus, mustExist) }
../../server/node.go:793 server.(*Node).startWriteNodeStatus { // node decommissioning, so we have to error out if we can't create it. }
../../server/server.go:1802 server.(*Server).PreStart { // Begin recording status summaries. }
../../server/server.go:1253 server.(*Server).Start { func (s *Server) Start(ctx context.Context) error { }
../../server/testserver.go:499 server.(*TestServer).Start { func (ts *TestServer) Start(ctx context.Context) error { }
../../testutils/testcluster/testcluster.go:485 testcluster.(*TestCluster).startServer { server := tc.Servers[idx] }
../../testutils/testcluster/testcluster.go:315 testcluster.(*TestCluster).Start { } else { }
../../testutils/testcluster/testcluster.go:185 testcluster.StartTestCluster { cluster.Start(t) }
datadriven_test.go:164 multiregionccl_test.TestMultiRegionDataDriven.func1.1 { }) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:321 datadriven.runDirective.func1 { actual := f(t, d) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:354 datadriven.runDirective { expectedLines := difflib.SplitLines(d.Expected) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:196 datadriven.runDirectiveOrSubTest { } }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:166 datadriven.runTestInternal { r := newTestDataReader(t, sourceName, reader, rewrite) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:136 datadriven.RunTest { rewriteData := runTestInternal(t, path, file, f, *rewriteTestFiles) }
datadriven_test.go:388 multiregionccl_test.TestMultiRegionDataDriven.func1 { }) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:413 datadriven.Walk { f(t, path) }
../../../vendor/github.com/cockroachdb/datadriven/datadriven.go:426 datadriven.Walk.func1 { Walk(t, filepath.Join(path, file.Name()), f) }

goroutine 38921 lock 0xc0040ecca8
../../util/tracing/crdbspan.go:360 tracing.(*crdbSpan).getVerboseRecording { // grandchildren there are. } <<<<<
../../util/tracing/crdbspan.go:357 tracing.(*crdbSpan).getVerboseRecording {  }
../../util/tracing/crdbspan.go:338 tracing.(*crdbSpan).getRecordingImpl { case RecordingVerbose: }
../../util/tracing/crdbspan.go:779 tracing.(*crdbSpan).childFinished { rec = child.GetRecording(RecordingVerbose, false /* finishing - the child is already finished */) }
../../util/tracing/crdbspan.go:321 tracing.(*crdbSpan).GetRecording { func (s *crdbSpan) GetRecording(recType RecordingType, finishing bool) Recording { }
../../util/tracing/crdbspan.go:242 tracing.(*crdbSpan).finish { parent.Span.i.crdb.childFinished(s) }
../../util/tracing/span_inner.go:91 tracing.(*spanInner).Finish {  }
../../util/tracing/span_inner.go:226 tracing.(*spanInner).Tracer { func (s *spanInner) Tracer() *Tracer { }
../../util/tracing/span.go:246 tracing.(*Span).Finish { sp.finishInternal() }
../../kv/kvserver/replica_range_lease.go:480 kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2 { p.cancelLocked() }
../../util/stop/stopper.go:494 stop.(*Stopper).RunAsyncTaskEx.func2 { f(ctx) }

goroutine 38921 lock 0xc0040edba8
../../util/tracing/crdbspan.go:726 tracing.(*crdbSpan).childFinished { s.mu.Lock() } <<<<<
../../util/tracing/crdbspan.go:725 tracing.(*crdbSpan).childFinished { func (s *crdbSpan) childFinished(child *crdbSpan) { }
../../util/tracing/crdbspan.go:242 tracing.(*crdbSpan).finish { parent.Span.i.crdb.childFinished(s) }
../../util/tracing/span_inner.go:91 tracing.(*spanInner).Finish {  }
../../util/tracing/span_inner.go:226 tracing.(*spanInner).Tracer { func (s *spanInner) Tracer() *Tracer { }
../../util/tracing/span.go:246 tracing.(*Span).Finish { sp.finishInternal() }
../../kv/kvserver/replica_range_lease.go:480 kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2 { p.cancelLocked() }
../../util/stop/stopper.go:494 stop.(*Stopper).RunAsyncTaskEx.func2 { f(ctx) }





ERROR: exit status 2

1 runs completed, 1 failures, over 1s
context canceled
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=deadlock

  • GOFLAGS=-parallel=4

Same failure on other branches

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

(Ignoring the last tracing failure that Andrei fixed)

These tests have been failing for a while, even before #73876 (which certainly couldn't have helped with its additional asynchrony). Here's one from 21.2: #73829. +cc #75543 (a bazel dup?). Looking at the test directives, we seem make use of arbitrary sleeps, perhaps that's why?

case "sleep-for-follower-read":
time.Sleep(time.Second)

In #75304 we observed the need to sometimes to sometimes retry the query itself to accommodate stale distsender caches, our parallel here being this directive, so perhaps that's not as much of a problem here:

case "refresh-range-descriptor-cache":

For the "wait-for-zone-config-changes" primitive, we can transfer the leaseholder by issuing a one-off transfer request, which I don't think guarantees that the lease would stay pinned to the destination. Likely a red-herring for these failures, but worth noting I think.

// "wait-for-zone-config-changes idx=server_number table-name=tbName
// [num-voters=num] [num-non-voters=num] [leaseholder=idx] [voter=idx_list]
// [non-voter=idx_list] [not-present=idx_list]": finds the
// range belonging to the given tbName's prefix key and runs it through the
// split, replicate, and raftsnapshot queues. If the num-voters or
// num-non-voters arguments are provided, it then makes sure that the range
// conforms to those.
// It also takes the arguments leaseholder, voter, nonvoter, and not-present,
// each of which take a comma-separated list of server indexes to check against
// the provided table.
// Note that if a server idx is not listed, it is unconstrained. This is
// necessary for cases where we know how many replicas of a type we want but
// we can't deterministically know their placement.
// If the leaseholder does not match the argument provided, a lease
// transfer is attempted.
// All this is done in a succeeds soon as any of these steps may error out for
// completely legitimate reasons.


I'm not planning on looking at these tests further unless pushed. +cc @arulajmani / @ajstorm / @nvanbenschoten: should we skip these tests on master? Do we think we'll have the time to polish them up? While they do provide good coverage, I do think it'd take time to make them a bit more robust, and in relation to #73876 at least, I'm not expecting any work here catch bugs as much as ensure this test is flake-proof.

@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 11, 2022
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 11, 2022
@AlexTalks AlexTalks added the S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) label Apr 19, 2022
@otan
Copy link
Contributor

otan commented Apr 26, 2022

hasn't triggered for over a month

@otan otan closed this as completed Apr 26, 2022
@irfansharif
Copy link
Contributor

hasn't triggered for over a month

They have on release-22.1 which I guess Arul's currently assigned to (#77908). I suspect this will just fail again but happy connect the history back to this issue when it happens. This is a pretty long running test, 45s+ at times, if it's not flaking in CI it's also possible because it's just not getting stressed enough. I hope we can either rewrite this test to be faster or rip it out entirely.

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants