roachtest: opt some tests into aggressive stats checks #87106

tbg · 2022-08-30T12:58:01Z

See full commit history. This PR teaches roachtest to pull an optional overriding Owner from errors encountered during tests, and then allows tests to opt into full consistency checks which, when failing, get assigned to the KV team.

Addresses #86649.

Release note: None

cockroach-teamcity · 2022-08-30T12:58:13Z

This change is

tbg · 2022-08-30T13:09:04Z

Running

GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c 1 -u 'clearrange/checks=false|schemachange/during/tpcc' -- --debug

to see what we get. We know we have some stats issues, so ideally one of them fires at will.

Here's the SQL for the check, helpful to try against clusters we have laying around anyway:

SELECT * FROM crdb_internal.check_consistency(false, '', '') as t
WHERE t.status NOT IN ('RANGE_CONSISTENT')

tbg · 2022-08-31T13:45:47Z

Hmm clearrange passed this time, and from the looks of it did run the check. Trying schema now. Maybe the crash I saw was #87167 and I didn't look closely enough.

tbg · 2022-08-31T13:49:37Z

No, it was an OOM (just different node). Definitely consistency check related ;-)

Before we dig too deeply into that one probably worth rebasing on top of #86883. It's possible that we're just failing each check here, abandoning the follower computations, moving on, and thus overloading things.

tbg · 2022-08-31T13:51:35Z

^--- logs dir from that node https://drive.google.com/file/d/1P0Nm_ypZBLHW0tgLbHxP7yNLSdJNaZ2G/view?usp=sharing

tbg · 2022-08-31T16:02:12Z

Added a commit that turns off requesting diffs (not sure why I turned that on; also not sure if it really does anything but wither way we want that off) and lets the SQL generator stream results back. With that, I get some rows from the schema test (the test failed due to OOM on previous SHA, but now I'm using the dataset cold to run the tests). It does get through a few ranges but then deterministically hits

ERROR: computing own checksum: rpc error: code = Unknown desc = no checksum found (ID = 7bcdae69-5e65-453a-92d6-8f5760498adf)

Tomorrow I'll adjust the generator to print those errors in the result set, so that it can continue to iterate. We already have the RANGE_INDETERMINATE state for that.

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None

88556: roachtest: use structured errors r=smg260 a=tbg `roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized on many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, #87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None 88614: sql/catalog/descs: remove allocations from hot path r=postamar a=ajwerner The lookup by ID path gets called constantly. This was over 1% of objects allocated in some workloads. Here's a microbenchmark: ``` name old time/op new time/op delta ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16 2.62µs ± 1% 2.18µs ± 1% -16.63% (p=0.000 n=10+8) name old alloc/op new alloc/op delta ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16 150B ± 0% 4B ± 0% -97.33% (p=0.001 n=8+9) name old allocs/op new allocs/op delta ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16 12.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) ``` Release note: None 89226: kvserver: remove leftover code from the RaftAppliedIndexTerm migration r=erikgrinaker,nvanbenschoten a=sumeerbhola The migration itself was already removed, but various supporting code still existed. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

This allows associating an owning team to an error. In the future we can use this to guide test failure assignment. Release note: None

Release note: None

…cc, clearrange Release note: None

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None

tbg mentioned this pull request Aug 30, 2022

roachtest: enable MVCC stats assertions more aggressively #86649

Open

tbg force-pushed the roachtest-stats-checks branch from 501aeb9 to 1740f4b Compare August 31, 2022 08:51

tbg mentioned this pull request Aug 31, 2022

kvserver: DistSender stack overflow crash #87167

Closed

tbg force-pushed the roachtest-stats-checks branch from 1740f4b to 092fc88 Compare August 31, 2022 15:51

tbg force-pushed the roachtest-stats-checks branch 3 times, most recently from 6a2041f to 3a72170 Compare August 31, 2022 22:11

pav-kv mentioned this pull request Sep 6, 2022

kvserver: remove consistency checker diffs #21128

Closed

tbg force-pushed the roachtest-stats-checks branch from 3a72170 to ff43619 Compare September 20, 2022 07:25

tbg mentioned this pull request Sep 23, 2022

roachtest: use structured errors #88556

Merged

tbg added 5 commits November 2, 2022 05:35

registry: allow tagging errors with Owners

6a20c9b

This allows associating an owning team to an error. In the future we can use this to guide test failure assignment. Release note: None

roachtest: allow failures to declare owning team

7196ceb

Release note: None

roachtest: allow tests to opt into full consistency checks

b04fa15

Release note: None

roachtest: opt into full consistency checks in schemachange/during/tp…

527bca6

…cc, clearrange Release note: None

wip

0d335de

tbg force-pushed the roachtest-stats-checks branch from ff43619 to 0d335de Compare November 2, 2022 05:08

tbg closed this Dec 7, 2022

smg260 mentioned this pull request Dec 9, 2022

roachtest: use structured errors #93328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: opt some tests into aggressive stats checks #87106

roachtest: opt some tests into aggressive stats checks #87106

tbg commented Aug 30, 2022

cockroach-teamcity commented Aug 30, 2022

tbg commented Aug 30, 2022 •

edited

Loading

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

roachtest: opt some tests into aggressive stats checks #87106

roachtest: opt some tests into aggressive stats checks #87106

Conversation

tbg commented Aug 30, 2022

cockroach-teamcity commented Aug 30, 2022

tbg commented Aug 30, 2022 • edited Loading

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

tbg commented Aug 31, 2022

tbg commented Aug 30, 2022 •

edited

Loading