-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/streamingccl/streamproducer: TestReplicationStreamTenant failed #75334
Comments
see #75425 (comment) |
ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ b95c51e44e62a777e068281fed5f3c384f6830e8:
Help
See also: How To Investigate a Go Test Failure (internal)
|
ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ 1c4b0ddb3c29305db08f9ae848a6c0d0b190f620:
Help
See also: How To Investigate a Go Test Failure (internal)
|
…urrently Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, a child will not be reused while the parent points to it. Also, defensively, we wipe c.mu.openChildren at the end of c.Finish(). Release note: None Fixes cockroachdb#75495 Fixes cockroachdb#75414 Fixes cockroachdb#75334
…urrently Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, when c.finish() completes, we know that c is not reference any more by the parent. We also wipe p.mu.openChildren at the end of p.Finish(). This way, if a child c calls p.childFinish(c) afterwards, that call will be short-circuited and c's recording will not be needlessly collected. Release note: None Fixes cockroachdb#75495 Fixes cockroachdb#75414 Fixes cockroachdb#75334
74803: spanconfigsqltranslator: populate protected_timestamps in SpanConfig r=irfansharif,arulajmani a=adityamaru This change teaches the SQLTranslator to hydrate the SpanConfigs for a table with protected timestamps that apply to that table. Concretely, this change initializes a spanconfig.ProtectedTimestampStateReader in the txn in which the translation is taking place, thereby providing a transactional view of the system.protected_ts_records table. After generating the span configurations based on the zone configurations that apply to the table, we hydrate the newly introduced protected_timestamps field on each span configuration with all the protected timestamps that apply to this table. This includes protected timestamp records that directly target this table, as well as records targetting the table's parent database. This information is obtained from the ProtectedTimestampStateReader mentioned above. Additionally, this change modifies StartTenant to allow secondary tenants to interact with the protected timestamp subsystem using a "real" protectedts.Provider provided the migration EnableProtectedTimestampsForTenant has run. For testing purposes, this change teaches the data driven framework of two additional commands protect and release. Informs: #73727 Release note: None 75458: ui: move loading and error pages to below page settings r=xinhaoz a=xinhaoz Fixes: #75247 Previously, the loading and error pages in the statement and txn pages took up the entire page, including the settings header used to issue new stats requests. This resulted in the user being unable to re-issue a query after a query results in a timeout, as we save the query parameters for re-use on each request. Release note (ui change): Loading and error pages are now below page config in txns and stmts pages. --------------- https://www.loom.com/share/e903ec74441140be98969c02ebbc7923 75475: distsender: fix ResumeSpan for Gets r=RaduBerinde a=RaduBerinde The distsender is not fully implementing the ResumeSpan contract for Get; the promise is that a nil span is set when the operation has successfully completed. A Get that reaches a kvserver but that is not executed has the ResumeSpan set on the server side. But if the requests span multiple ranges, a subset of the requests will not make it to any kvserver. This change fills in the missing case and improves the TestMultiRangeScanWithPagination test to allow running mixed operations. Fixes #74736. Release note (bug fix): In particular cases, some queries that involve a scan which returns many results and which includes lookups for individual keys were not returning all results from the table. 75515: sql: de-flake inverted_filter_geospatial_dist r=cucaroach a=cucaroach Fixes: #75497 Add a retry to a test flaking due to distsender span config cache issues. Release note: None 75518: tracing: fix span use after release when parent and child finish concurrently r=andreimatei a=andreimatei Before this patch, it was possible for a span to be released to the sync.Pool and reused while the tracing code was still holding a reference to it. This was because a span could maintain unaccounted references to its children even after those children were released. And those reference could be inadvertently used, causing a data race and trace corruption. This race had two parts: 1) s.mu.openChildren could still hold references to children after s.finish() terminated. Those references were unaccounted for, so those children could be reused if they finish. This would be fine (or at least benign) by itself (since a finished span should generally not be accessed), if it wasn't for 2). 2) s' recording can be collected after s.Finish() ran, and that would call into s' child. In order to trigger the bad behavior, I believe you need a child and a parent to finish() concurrently, and a chain of 4 spans in parent-child relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent, c=child). The race scenario goes like this: 1. gp.Finish() is called; gp.mu.finished is set early. 2. p.Finish() is called; p.mu.finished is set early. 3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p to remove c from c.mu.openChildren, and to collect its recording if it wants to. This call is a no-op because p.mu.finished is already set. 4. p.Finish() terminates. c remains part of c.mu.openChildren, but the corresponding reference count on c is decremented. Similarly, p remains part of gp.mu.openChildren. 5. c.Finish() terminates. c's reference count drops to zero (because p no longer counts a reference to it). c is release to the pool, and reused. 6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp collects gp's recording, and recursively that calls into p, and then c. This is a data race, since c was already reused. To fix this, we have p.childFinished(c) no longer automatically be a no-op when p is already finished. Instead, if c is still part of c.mu.openChildren, it will be removed. This way, when c.finish() completes, we know that c is not reference any more by the parent. We also wipe p.mu.openChildren at the end of p.Finish(). This way, if a child c calls p.childFinish(c) afterwards, that call will be short-circuited and c's recording will not be needlessly collected. Release note: None Fixes #75495 Fixes #75414 Fixes #75334 Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
ccl/streamingccl/streamproducer.TestReplicationStreamTenant failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:
Help
See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: