Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

Closed
cockroach-teamcity opened this issue Jan 23, 2022 · 3 comments · Fixed by #75518
Closed

ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334

cockroach-teamcity opened this issue Jan 23, 2022 · 3 comments · Fixed by #75518
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

      /go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:43 +0x164
  github.com/cockroachdb/cockroach/pkg/sql.startExec()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:519 +0x180
  github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:146 +0x124
  github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:158 +0x14b
  github.com/cockroachdb/cockroach/pkg/sql/colexec.(*invariantsChecker).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:68 +0x97
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*batchInfoCollector).Init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/stats.go:87 +0x58
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedStatsCollectorImpl).Init()
      <autogenerated>:1 +0x4b
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:235 +0x68
  github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x7b
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).init()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:234 +0x70
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*BatchFlowCoordinator).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:275 +0x330
  github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:259 +0x328
  github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:563 +0xf94
  github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1401 +0x249
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1408 +0xceb
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1085 +0x12c4
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:722 +0x342a
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:135 +0x93c
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPortal()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:222 +0x1dc
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func2()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1820 +0xa1c
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1822 +0x6f8
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1676 +0x5fe
  github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:681 +0x14b
  github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
      /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:669 +0x405
==================
    testing.go:1152: race detected during execution of test
    --- FAIL: TestReplicationStreamTenant/stream-tenant-with-cursor (0.09s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-race -parallel=4

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 23, 2022
@blathers-crl blathers-crl bot added the T-cdc label Jan 23, 2022
@andreimatei andreimatei self-assigned this Jan 23, 2022
@andreimatei
Copy link
Contributor

see #75425 (comment)

@andreimatei andreimatei removed the T-cdc label Jan 23, 2022
@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ b95c51e44e62a777e068281fed5f3c384f6830e8:

=== RUN   TestReplicationStreamTenant/stream-tenant-with-cursor
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ 1c4b0ddb3c29305db08f9ae848a6c0d0b190f620:

=== RUN   TestReplicationStreamTenant
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicationStreamTenant243999275
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestReplicationStreamTenant
    replication_stream_test.go:275: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicationStreamTenant243999275
    testing.go:1152: race detected during execution of test
--- FAIL: TestReplicationStreamTenant (3.39s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-race -parallel=4

This test on roachdash | Improve this report!

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 25, 2022
…urrently

Before this patch, it was possible for a span to be released to the
sync.Pool and reused while the tracing code was still holding a
reference to it. This was because a span could maintain unaccounted
references to its children even after those children were released. And
those reference could be inadvertently used, causing a data race and
trace corruption.

This race had two parts:
1) s.mu.openChildren could still hold references to children after
   s.finish() terminated. Those references were unaccounted for, so
   those children could be reused if they finish. This would be fine (or
   at least benign) by itself (since a finished span should generally
   not be accessed), if it wasn't for 2).
2) s' recording can be collected after s.Finish() ran, and that would
   call into s' child.

In order to trigger the bad behavior, I believe you need a child and a
parent to finish() concurrently, and a chain of 4 spans in parent-child
relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent,
c=child). The race scenario goes like this:

1. gp.Finish() is called; gp.mu.finished is set early.
2. p.Finish() is called; p.mu.finished is set early.
3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p
   to remove c from c.mu.openChildren, and to collect its recording if
   it wants to. This call is a no-op because p.mu.finished is already set.
4. p.Finish() terminates. c remains part of c.mu.openChildren, but the
   corresponding reference count on c is decremented. Similarly, p
   remains part of gp.mu.openChildren.
5. c.Finish() terminates. c's reference count drops to zero (because p
   no longer counts a reference to it). c is release to the pool, and
   reused.
6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp
   collects gp's recording, and recursively that calls into p, and then
   c. This is a data race, since c was already reused.

To fix this, we have p.childFinished(c) no longer automatically be a
no-op when p is already finished. Instead, if c is still part of
c.mu.openChildren, it will be removed. This way, a child will not be
reused while the parent points to it.
Also, defensively, we wipe c.mu.openChildren at the end of c.Finish().

Release note: None

Fixes cockroachdb#75495
Fixes cockroachdb#75414
Fixes cockroachdb#75334
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 25, 2022
…urrently

Before this patch, it was possible for a span to be released to the
sync.Pool and reused while the tracing code was still holding a
reference to it. This was because a span could maintain unaccounted
references to its children even after those children were released. And
those reference could be inadvertently used, causing a data race and
trace corruption.

This race had two parts:
1) s.mu.openChildren could still hold references to children after
   s.finish() terminated. Those references were unaccounted for, so
   those children could be reused if they finish. This would be fine (or
   at least benign) by itself (since a finished span should generally
   not be accessed), if it wasn't for 2).
2) s' recording can be collected after s.Finish() ran, and that would
   call into s' child.

In order to trigger the bad behavior, I believe you need a child and a
parent to finish() concurrently, and a chain of 4 spans in parent-child
relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent,
c=child). The race scenario goes like this:

1. gp.Finish() is called; gp.mu.finished is set early.
2. p.Finish() is called; p.mu.finished is set early.
3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p
   to remove c from c.mu.openChildren, and to collect its recording if
   it wants to. This call is a no-op because p.mu.finished is already set.
4. p.Finish() terminates. c remains part of c.mu.openChildren, but the
   corresponding reference count on c is decremented. Similarly, p
   remains part of gp.mu.openChildren.
5. c.Finish() terminates. c's reference count drops to zero (because p
   no longer counts a reference to it). c is release to the pool, and
   reused.
6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp
   collects gp's recording, and recursively that calls into p, and then
   c. This is a data race, since c was already reused.

To fix this, we have p.childFinished(c) no longer automatically be a
no-op when p is already finished. Instead, if c is still part of
c.mu.openChildren, it will be removed. This way, when c.finish()
completes, we know that c is not reference any more by the parent.  We
also wipe p.mu.openChildren at the end of p.Finish(). This way, if a
child c calls p.childFinish(c) afterwards, that call will be
short-circuited and c's recording will not be needlessly collected.

Release note: None

Fixes cockroachdb#75495
Fixes cockroachdb#75414
Fixes cockroachdb#75334
craig bot pushed a commit that referenced this issue Jan 26, 2022
74803: spanconfigsqltranslator: populate protected_timestamps in SpanConfig r=irfansharif,arulajmani a=adityamaru

This change teaches the SQLTranslator to hydrate the SpanConfigs
for a table with protected timestamps that apply to that table.

Concretely, this change initializes a spanconfig.ProtectedTimestampStateReader
in the txn in which the translation is taking place, thereby
providing a transactional view of the system.protected_ts_records
table. After generating the span configurations based on the zone
configurations that apply to the table, we hydrate the newly
introduced protected_timestamps field on each span configuration
with all the protected timestamps that apply to this table. This
includes protected timestamp records that directly target this
table, as well as records targetting the table's parent database.
This information is obtained from the ProtectedTimestampStateReader
mentioned above.

Additionally, this change modifies StartTenant to allow secondary
tenants to interact with the protected timestamp subsystem using a
"real" protectedts.Provider provided the migration
EnableProtectedTimestampsForTenant has run.

For testing purposes, this change teaches the data driven framework
of two additional commands protect and release.

Informs: #73727

Release note: None

75458: ui: move loading and error pages to below page settings r=xinhaoz a=xinhaoz

Fixes: #75247

Previously, the loading and error pages in the statement and txn
pages took up the entire page, including the settings header
used to issue new stats requests. This resulted in the user
being unable to re-issue a query after a query results in
a timeout, as we save the query parameters for re-use on
each request.

Release note (ui change): Loading and error pages are now below
page config in txns and stmts pages.

---------------
https://www.loom.com/share/e903ec74441140be98969c02ebbc7923

75475: distsender: fix ResumeSpan for Gets r=RaduBerinde a=RaduBerinde

The distsender is not fully implementing the ResumeSpan contract for
Get; the promise is that a nil span is set when the operation has
successfully completed.

A Get that reaches a kvserver but that is not executed has the
ResumeSpan set on the server side. But if the requests span multiple
ranges, a subset of the requests will not make it to any kvserver.

This change fills in the missing case and improves the
TestMultiRangeScanWithPagination test to allow running mixed
operations.

Fixes #74736.

Release note (bug fix): In particular cases, some queries that involve
a scan which returns many results and which includes lookups for
individual keys were not returning all results from the table.

75515: sql: de-flake inverted_filter_geospatial_dist r=cucaroach a=cucaroach

Fixes: #75497

Add a retry to a test flaking due to distsender span config cache issues.

Release note: None

75518: tracing: fix span use after release when parent and child finish concurrently r=andreimatei a=andreimatei

Before this patch, it was possible for a span to be released to the
sync.Pool and reused while the tracing code was still holding a
reference to it. This was because a span could maintain unaccounted
references to its children even after those children were released. And
those reference could be inadvertently used, causing a data race and
trace corruption.

This race had two parts:
1) s.mu.openChildren could still hold references to children after
   s.finish() terminated. Those references were unaccounted for, so
   those children could be reused if they finish. This would be fine (or
   at least benign) by itself (since a finished span should generally
   not be accessed), if it wasn't for 2).
2) s' recording can be collected after s.Finish() ran, and that would
   call into s' child.

In order to trigger the bad behavior, I believe you need a child and a
parent to finish() concurrently, and a chain of 4 spans in parent-child
relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent,
c=child). The race scenario goes like this:

1. gp.Finish() is called; gp.mu.finished is set early.
2. p.Finish() is called; p.mu.finished is set early.
3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p
   to remove c from c.mu.openChildren, and to collect its recording if
   it wants to. This call is a no-op because p.mu.finished is already set.
4. p.Finish() terminates. c remains part of c.mu.openChildren, but the
   corresponding reference count on c is decremented. Similarly, p
   remains part of gp.mu.openChildren.
5. c.Finish() terminates. c's reference count drops to zero (because p
   no longer counts a reference to it). c is release to the pool, and
   reused.
6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp
   collects gp's recording, and recursively that calls into p, and then
   c. This is a data race, since c was already reused.

To fix this, we have p.childFinished(c) no longer automatically be a
no-op when p is already finished. Instead, if c is still part of
c.mu.openChildren, it will be removed. This way, when c.finish()
completes, we know that c is not reference any more by the parent.  We
also wipe p.mu.openChildren at the end of p.Finish(). This way, if a
child c calls p.childFinish(c) afterwards, that call will be
short-circuited and c's recording will not be needlessly collected.

Release note: None

Fixes #75495
Fixes #75414
Fixes #75334

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in e135d09 Jan 26, 2022
@craig craig bot closed this as completed in #75518 Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants