Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: roachtest structured errors and ssh/scp retries #95448

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Jan 18, 2023

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: test only change to make roachtests more resilient and easier to debug

renatolabs and others added 11 commits January 18, 2023 17:17
In 2019, TeamCity released a new "experimental UI" that changed not
only the look and feel of the app, but also the URLs where pages are
located. Notably, information about a build used to be under the
/viewLog.html page in the old UI, but was moved to
/buildConfiguration/{buildTypeID}/{buildID} in the new UI.

If the user enabled the new UI locally, TeamCity will automatically
redirect requests from the old URLs to the new ones; however, the URL
fragments are not preserved. This is quite bad for roachtest, since
the links generated in failed test reports relies on the fragments in
order to expand the relevant folders in a potentially long list of
artifacts.

This commit updates TeamCity links generated by roachtest when
reporting failed tests to always use the new UI. It has the drawback
of forcing the new UI on everyone, but the benefit of always expanding
the correct folder.

A new environment variable, `TC_BUILDTYPE_ID`, was added to TeamCity
builds to support this use case; it contains the ID of the
corresponding build type (also known as "build configuration").

Resolves: cockroachdb#80852.

Release note: None.
This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: cockroachdb#80799.

Release note: None.
…orted.

Some runtime values are probabilistically set or otherwise have
defaults which are not evident at the time we report issues.

This change returns the vm options associated with a created cluster,
from which we can extract the final computed values for such fields
including the type of filesystem and whether encryption is enabled.

Release justification: test-only change
Release note: none
…test.

No previous coverage exists for posting issues to github. This change
moves the relevant code into a separate source file with associated test.
External functions dealing with actual issue posting to github, and loading
the TEAMS.yaml are now injected to facilitate easier testing.

Release justification: test-only change
Release note: none
When a cluster is started with the `--skip-init` option, the caller
can run `roachprod init` at any time to initialize the
cluster. Unfortunately, the code used to initialize the cluster was
duplicated: one copy existed in the `start` path, and another in the
`init` path. Since the latter is used far less frequently, it had a
bug that went unnoticed: it hardcoded the first node index as `0`,
when node indices start at 1.

This commit fixes the issue by updating the constant and sharing code
between `init `and `start`.

Fixes cockroachdb#88226.

Release note: None
In cockroachdb#88514, the cluster start logic was refactored to reuse the same
code across `init` and `start`, fixing a bug in the former. However,
the refactoring overlooked the fact that we previously always set the
default cluster settings when there's more than one node in the
cluster.

This fixes that by setting the default cluster settings in that case;
one particularly important cluster setting is the license key,
necessary for some roachtests.

Fixes cockroachdb#88660.
Fixes cockroachdb#88665.
Fixes cockroachdb#88666.
Fixes cockroachdb#88710.

Release note: None
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
`roachprod.Get` and `roachprod.Put` behaved differently when it comes
to logging of errors: the former would not log errors found while
running `scp` if there was a file backing the logger, while the latter
would.

However, there is no reason for this difference, and it was probably
an oversight introduced in an old, really large PR (cockroachdb#74223). This
commit makes behavior consistent in both calls and should allow us to
see `Get` errors in logs, especially important in `roachtest`
failures.

Epic: None.

Release note: None.
 SSH flakes to the test-eng team.

 Also abstracts away how issue name and team is overriden based on
 an issue category, which can be ClusterCreation, SSH, or Other

 Resolves issue cockroachdb#82398

 Release justification: test-only change
 Release note: None
This commit utilises cockroachdb/errors markers api to mark an SSH
error with exit code of 255. The test runner, when it is posting an
issue to github, can act according to which marker may be present
in t.mu.failures[0]. In this case, we override the owning team to
protect them from having to investigate what is likely a transient
SSH issue.

A test now has the concept of a failure[] instead of error[]. Each
failure contains the original `squashedErr` and an errors[],
which are all the errors passed via t.Failf/Errorf.

We can then preserve the relationship of multiple
errors to any particular failure within a test, and match on any given
reference error (like SSH flakes from above)

This moves us away from substring matching on error messages.

The test name is prepended to both cluster creation and ssh errors
to ease troubleshooting as all those issues will be bucketed
under either category respectively.

Release justification: test-only change
Release note: None
is used. This prevents OSX from prompting to allow incoming
network connections when running a roachtest from an ide or
 shell.

Release justification: test-only change
Release note: None
@blathers-crl
Copy link

blathers-crl bot commented Jan 18, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@smg260 smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from 9ef35bd to 1a01005 Compare January 18, 2023 17:50
@smg260 smg260 changed the title release-22.1: TODO release-22.1: roachtest structured errors and ssh/scp retries Jan 18, 2023
@smg260 smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from 1a01005 to b44b6be Compare January 18, 2023 19:08
Miral Gadani and others added 5 commits January 18, 2023 15:47
TestLint/TestRoachVet is a unit test which runs roachvet and greps various
messages from its output - the reverted message being one of them.

Release note: none.
Epic: none
Prior to this commit, the error's stack trace did not link back
to the caller of `ParallelE`. Now it does.

Release note: None
…try when

encountering an error of 255 (SSH). The retry could be exposed to
callers fairly simply, but this PR does not introduce that.

Today there appears the concept of a "roachprod" and a "command error",
the former of which denotes an issue in roachprod handling code, the
latter representing an error executing a command over SSH. This PR
preserves this behaviour and introduces an updated function signature
from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error).

RunResultDetails has been expanded to capture information about the
execution of a command, such as stderr/our, exit code, error, attempt
number. Any roachprod error will result in an error being returned, and set
in RunResultDetails.Err. Any command error would be only set in
RunResultDetails.Err with the returned error being nil. This allows callers to
differentiate between a roachprod and a command error, which existing code
depends on.

Retry handling code can use a function's returned RunResultDetails to
determine whether a retry should take place. The default retry
handling occurs on `RunResultDetails.RemoteExitStatus == 255`

Release note: None
Epic: CRDB-21386
Any error returned is assumed to be retryable (in contrast to SSH where
by default we look for an exit code of 255)

Release note: none
Epic: CRDB-21386
Errors were reported like this:

> (test_impl.go:291).Fatal: pq: invalid syntax @1

but that line is the impl of `Fatal`. We want to walk up one more frame.

Epic: none
Release note: None
@smg260 smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from b44b6be to f7c1cce Compare January 18, 2023 20:47
@smg260 smg260 marked this pull request as ready for review January 18, 2023 20:55
@smg260 smg260 requested review from a team as code owners January 18, 2023 20:55
@smg260 smg260 requested review from herkolategan, renatolabs and srosenberg and removed request for a team January 18, 2023 20:55
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the backport! This will keep 22.1 ssh-related flakes under control.

@smg260
Copy link
Contributor Author

smg260 commented Jan 20, 2023

Nightly roachtest on this branch

@smg260 smg260 merged commit 59f807e into cockroachdb:release-22.1 Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants