ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

cockroach-teamcity · 2022-01-23T06:32:22Z

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

      /go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:43 +0x164
  github.com/cockroachdb/cockroach/pkg/sql.startExec()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:519 +0x180
  github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:146 +0x124
  github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:158 +0x14b
  github.com/cockroachdb/cockroach/pkg/sql/colexec.(*invariantsChecker).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:68 +0x97
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:87 +0x58
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedStatsCollectorImpl).Init()
      <autogenerated>:1 +0x4b
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:235 +0x68
  github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x7b
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:234 +0x70
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:275 +0x330
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:259 +0x328
  github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:563 +0xf94
  github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1401 +0x249
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1408 +0xceb
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1085 +0x12c4
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:722 +0x342a
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:135 +0x93c
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPortal()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:222 +0x1dc
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1820 +0xa1c
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1822 +0x6f8
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1676 +0x5fe
  github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:681 +0x14b
  github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:669 +0x405
==================
    testing.go:1152: race detected during execution of test
    --- FAIL: TestReplicationStreamTenant/stream-tenant-with-cursor (0.09s)

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

GOFLAGS=-race -parallel=4

/cc @cockroachdb/cdc _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

andreimatei · 2022-01-23T16:25:09Z

see #75425 (comment)

cockroach-teamcity · 2022-01-25T07:19:29Z

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ b95c51e44e62a777e068281fed5f3c384f6830e8:

=== RUN   TestReplicationStreamTenant/stream-tenant-with-cursor

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

TAGS=bazel,gss

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2022-01-25T12:23:03Z

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ 1c4b0ddb3c29305db08f9ae848a6c0d0b190f620:

=== RUN   TestReplicationStreamTenant
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicationStreamTenant243999275
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestReplicationStreamTenant
    replication_stream_test.go:275: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicationStreamTenant243999275
    testing.go:1152: race detected during execution of test
--- FAIL: TestReplicationStreamTenant (3.39s)

Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

GOFLAGS=-race -parallel=4

_{This test on roachdash | Improve this report!}

…urrently Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, a child will not be reused while the parent points to it. Also, defensively, we wipe c.mu.openChildren at the end of c.Finish(). Release note: None Fixes cockroachdb#75495 Fixes cockroachdb#75414 Fixes cockroachdb#75334

…urrently Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, when c.finish() completes, we know that c is not reference any more by the parent. We also wipe p.mu.openChildren at the end of p.Finish(). This way, if a child c calls p.childFinish(c) afterwards, that call will be short-circuited and c's recording will not be needlessly collected. Release note: None Fixes cockroachdb#75495 Fixes cockroachdb#75414 Fixes cockroachdb#75334

74803: spanconfigsqltranslator: populate protected_timestamps in SpanConfig r=irfansharif,arulajmani a=adityamaru This change teaches the SQLTranslator to hydrate the SpanConfigs for a table with protected timestamps that apply to that table. Concretely, this change initializes a spanconfig.ProtectedTimestampStateReader in the txn in which the translation is taking place, thereby providing a transactional view of the system.protected_ts_records table. After generating the span configurations based on the zone configurations that apply to the table, we hydrate the newly introduced protected_timestamps field on each span configuration with all the protected timestamps that apply to this table. This includes protected timestamp records that directly target this table, as well as records targetting the table's parent database. This information is obtained from the ProtectedTimestampStateReader mentioned above. Additionally, this change modifies StartTenant to allow secondary tenants to interact with the protected timestamp subsystem using a "real" protectedts.Provider provided the migration EnableProtectedTimestampsForTenant has run. For testing purposes, this change teaches the data driven framework of two additional commands protect and release. Informs: #73727 Release note: None 75458: ui: move loading and error pages to below page settings r=xinhaoz a=xinhaoz Fixes: #75247 Previously, the loading and error pages in the statement and txn pages took up the entire page, including the settings header used to issue new stats requests. This resulted in the user being unable to re-issue a query after a query results in a timeout, as we save the query parameters for re-use on each request. Release note (ui change): Loading and error pages are now below page config in txns and stmts pages. --------------- https://www.loom.com/share/e903ec74441140be98969c02ebbc7923 75475: distsender: fix ResumeSpan for Gets r=RaduBerinde a=RaduBerinde The distsender is not fully implementing the ResumeSpan contract for Get; the promise is that a nil span is set when the operation has successfully completed. A Get that reaches a kvserver but that is not executed has the ResumeSpan set on the server side. But if the requests span multiple ranges, a subset of the requests will not make it to any kvserver. This change fills in the missing case and improves the TestMultiRangeScanWithPagination test to allow running mixed operations. Fixes #74736. Release note (bug fix): In particular cases, some queries that involve a scan which returns many results and which includes lookups for individual keys were not returning all results from the table. 75515: sql: de-flake inverted_filter_geospatial_dist r=cucaroach a=cucaroach Fixes: #75497 Add a retry to a test flaking due to distsender span config cache issues. Release note: None 75518: tracing: fix span use after release when parent and child finish concurrently r=andreimatei a=andreimatei Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, when c.finish() completes, we know that c is not reference any more by the parent. We also wipe p.mu.openChildren at the end of p.Finish(). This way, if a child c calls p.childFinish(c) afterwards, that call will be short-circuited and c's recording will not be needlessly collected. Release note: None Fixes #75495 Fixes #75414 Fixes #75334 Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 23, 2022

blathers-crl bot added the T-cdc label Jan 23, 2022

andreimatei self-assigned this Jan 23, 2022

andreimatei mentioned this issue Jan 23, 2022

sql: span use-after-Finish in "batch flow coordinator" #75425

Closed

andreimatei removed the T-cdc label Jan 23, 2022

andreimatei mentioned this issue Jan 25, 2022

tracing: fix span use after release when parent and child finish concurrently #75518

Merged

craig bot closed this as completed in e135d09 Jan 26, 2022

craig bot closed this as completed in #75518 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

cockroach-teamcity commented Jan 23, 2022

andreimatei commented Jan 23, 2022

cockroach-teamcity commented Jan 25, 2022

cockroach-teamcity commented Jan 25, 2022

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

Comments

cockroach-teamcity commented Jan 23, 2022

andreimatei commented Jan 23, 2022

cockroach-teamcity commented Jan 25, 2022

cockroach-teamcity commented Jan 25, 2022