Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: opt some tests into aggressive stats checks #87106

Closed
wants to merge 5 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 30, 2022

See full commit history. This PR teaches roachtest to pull an optional overriding Owner from errors encountered during tests, and then allows tests to opt into full consistency checks which, when failing, get assigned to the KV team.

Addresses #86649.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Aug 30, 2022

Running

GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c 1 -u 'clearrange/checks=false|schemachange/during/tpcc' -- --debug

to see what we get. We know we have some stats issues, so ideally one of them fires at will.

Here's the SQL for the check, helpful to try against clusters we have laying around anyway:

SELECT * FROM crdb_internal.check_consistency(false, '', '') as t
WHERE t.status NOT IN ('RANGE_CONSISTENT')

@tbg
Copy link
Member Author

tbg commented Aug 31, 2022

Hmm clearrange passed this time, and from the looks of it did run the check. Trying schema now. Maybe the crash I saw was #87167 and I didn't look closely enough.

@tbg
Copy link
Member Author

tbg commented Aug 31, 2022

No, it was an OOM (just different node). Definitely consistency check related ;-)

image

Before we dig too deeply into that one probably worth rebasing on top of #86883. It's possible that we're just failing each check here, abandoning the follower computations, moving on, and thus overloading things.

@tbg
Copy link
Member Author

tbg commented Aug 31, 2022

@tbg
Copy link
Member Author

tbg commented Aug 31, 2022

Added a commit that turns off requesting diffs (not sure why I turned that on; also not sure if it really does anything but wither way we want that off) and lets the SQL generator stream results back. With that, I get some rows from the schema test (the test failed due to OOM on previous SHA, but now I'm using the dataset cold to run the tests). It does get through a few ranges but then deterministically hits

ERROR: computing own checksum: rpc error: code = Unknown desc = no checksum found (ID = 7bcdae69-5e65-453a-92d6-8f5760498adf)

Tomorrow I'll adjust the generator to print those errors in the result set, so that it can continue to iterate. We already have the RANGE_INDETERMINATE state for that.

@tbg tbg force-pushed the roachtest-stats-checks branch 3 times, most recently from 6a2041f to 3a72170 Compare August 31, 2022 22:11
tbg added a commit to tbg/cockroach that referenced this pull request Sep 23, 2022
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Oct 4, 2022
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 4, 2022
88556: roachtest: use structured errors r=smg260 a=tbg

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized on many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  #87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None


88614: sql/catalog/descs: remove allocations from hot path r=postamar a=ajwerner

The lookup by ID path gets called constantly. This was over 1% of objects allocated in some workloads. Here's a microbenchmark:

```
name                                                                    old time/op    new time/op    delta
ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16    2.62µs ± 1%    2.18µs ± 1%   -16.63%  (p=0.000 n=10+8)

name                                                                    old alloc/op   new alloc/op   delta
ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16      150B ± 0%        4B ± 0%   -97.33%  (p=0.001 n=8+9)

name                                                                    old allocs/op  new allocs/op  delta
ResolveExistingObject/CREATE_SCHEMA_sc;CREATE_TABLE_sc.foo_()sc.foo-16      12.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)
```

Release note: None

89226: kvserver: remove leftover code from the RaftAppliedIndexTerm migration r=erikgrinaker,nvanbenschoten a=sumeerbhola

The migration itself was already removed, but various supporting code still existed.

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
This allows associating an owning team to an error. In the future
we can use this to guide test failure assignment.

Release note: None
@tbg tbg closed this Dec 7, 2022
smg260 pushed a commit to smg260/cockroach that referenced this pull request Dec 9, 2022
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
smg260 pushed a commit to smg260/cockroach that referenced this pull request Jan 18, 2023
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
smg260 pushed a commit to smg260/cockroach that referenced this pull request Jan 18, 2023
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants