-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/opt/exec/execbuilder: TestExecBuild failed #74933
Comments
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 365b4da8bd02c06ee59d2130a56dec74ffc9ce21:
Help
See also: How To Investigate a Go Test Failure (internal)
|
The one thing that's different between the expected and actual is cut off above. Expected is |
Both failures above occurred in |
reproduced pretty readily like this:
Trying to run just one w/o stress using the correct rand seed doesn't though. Something stress specific about it? |
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 5aefc070d1e7f5bfd860eae24dea2e7726dc6d8c:
Help
See also: How To Investigate a Go Test Failure (internal)
|
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ d1eafdf56b42ca09a169c199916917a144c2c412:
Help
See also: How To Investigate a Go Test Failure (internal)
|
Dropping in to say that I think this flaked in CI more frequently after #73876, I think (going off on the test history alone). |
Well I've chased it down to the span resolver sometimes putting the table reader on the gateway node causing the plan to be local instead of full. Its very flakey though and it only seems to happen when the machine is overloaded, I haven't figured out more than that and don't know if that has anything to do with the span config changes. |
I think @irfansharif is right, I can't get a failure if I do this:
That said I will create a new bug for looking at this tests, they are obviously confused, ie:
|
@irfansharif let me know if there's any I can do to help with this, I repro it in ~1m. |
Repro steps? I probably won't get to it until next week. I should clarify that a of the fallout from enabling span configs has been around synchrony assumptions made in test code that never held but was more easily masked over previously. I'm not sure if that's what's happening here. If you have any working theories, I'll happily take them. |
If I bump up p from 4 to 12 and hits on first run usually. It doesn't repro unless you run all of opt tests, like just the 5node or just the inverted_filter_geospatial_dist tests doesn't fail. Its a pretty simple test and it confirms the index has ranges on the right nodes. If you have a suspicion on where the randomness might be coming from I'd be happy to dig deeper. |
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 506412ffdb6d0f6d4952d2599b81cb9294a123a1:
Help
See also: How To Investigate a Go Test Failure (internal)
|
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 03556f98c29c65e022d2c29cb03c99d0dc1b31b0:
Help
See also: How To Investigate a Go Test Failure (internal)
|
Still poking blindly, but #75208 at a surface has similar symptoms (also possibly my fault). |
Might pointing me to the code where this happens? Or what goes into it? Maybe this guy? cockroach/pkg/sql/distsql_physical_planner.go Lines 1419 to 1421 in 059d5d8
Can confirm it easily repros on my GCE worker:
For posterity, here's the physical plan than the test expects, and here's the one it actually gets. |
Fixes cockroachdb#74933. This test asserts on physical plans of statements after moving ranges around, i.e. whether distsql execution is "local" or "distributed". This is determined by: - where the distsql planner places processor cores, - which in turn is a function of the span resolver, - which delegates to the embedded distsender, - which operates off of a range cache. The range cache, like the name suggests, can hold stale data. This test moved replicas of indexes around to induce distributed execution, but was not accounting for the caches holding onto stale data. In my test runs every so often I was seeing stale range descriptors from before the relocate trip up the planner to generate a local execution, flaking the test. Fortunately for us, all we need to do is slap on a retry. To repro: # Took under a minute on my gceworker. Helped to trim down the test # file slightly. dev test pkg/sql/opt/exec/execbuilder \ -f 'TestExecBuild/5node' -v --show-logs \ --stress --stress-args '-p 4' Release note: None
75304: execbuilder: deflake TestExecBuild r=irfansharif a=irfansharif Fixes #74933. This test asserts on physical plans of statements after moving ranges around, i.e. whether distsql execution is "local" or "distributed". This is determined by: - where the distsql planner places processor cores, - which in turn is a function of the span resolver, - which delegates to the embedded distsender, - which operates off of a range cache. The range cache, like the name suggests, can hold stale data. This test moved replicas of indexes around to induce distributed execution, but was not accounting for the caches holding onto stale data. In my test runs every so often I was seeing stale range descriptors from before the relocate trip up the planner to generate a local execution, flaking the test. Fortunately for us, all we need to do is slap on a retry. To repro: # Took under a minute on my gceworker. Helped to trim down the test # file slightly. dev test pkg/sql/opt/exec/execbuilder \ -f 'TestExecBuild/5node' -v --show-logs \ --stress --stress-args '-p 4' Release note: None 75436: kvserver: de-flake TestProtectedTimestamps r=irfansharif a=irfansharif Fixed #75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before #73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None 75443: ui: removed formatting to statements on the details pages r=THardy98 a=THardy98 **ui: removed formatting to statements on the details pages** Previously, statements displayed on the statement/transaction/index details pages were formatted (formatting was added to allow for better readability of statements on these detail pages). However, statements queries from the frontend were noticeably slower due to this implementation. This change reverts the changes to statement formatting (updates the fixtures to show the non-formatted statements), but keeps the change that uses statement ID as an identifier in the URL instead of the raw statement. **Release note (ui change)**: removed formatting to statements on the statement, transaction and index details pages, change to replace raw statement with statement ID in the URL remained. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Thomas Hardy <thardy@cockroachlabs.com>
sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 365b4da8bd02c06ee59d2130a56dec74ffc9ce21:
Help
See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: