Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: roachtest structured errors and ssh/scp retries #95448

Commits on Jan 18, 2023

  1. roachtest: update TeamCity links to the new UI.

    In 2019, TeamCity released a new "experimental UI" that changed not
    only the look and feel of the app, but also the URLs where pages are
    located. Notably, information about a build used to be under the
    /viewLog.html page in the old UI, but was moved to
    /buildConfiguration/{buildTypeID}/{buildID} in the new UI.
    
    If the user enabled the new UI locally, TeamCity will automatically
    redirect requests from the old URLs to the new ones; however, the URL
    fragments are not preserved. This is quite bad for roachtest, since
    the links generated in failed test reports relies on the fragments in
    order to expand the relevant folders in a potentially long list of
    artifacts.
    
    This commit updates TeamCity links generated by roachtest when
    reporting failed tests to always use the new UI. It has the drawback
    of forcing the new UI on everyone, but the benefit of always expanding
    the correct folder.
    
    A new environment variable, `TC_BUILDTYPE_ID`, was added to TeamCity
    builds to support this use case; it contains the ID of the
    corresponding build type (also known as "build configuration").
    
    Resolves: cockroachdb#80852.
    
    Release note: None.
    renatolabs authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    d562115 View commit details
    Browse the repository at this point in the history
  2. roachtest: include some cluster configs in test failure reports

    This commit changes the 'Parameters' section of a test
    failure. Previously, it would only include the contents of the `TAGS`
    and `GOFLAGS` environment variables. It now adds cluster configuration
    values obtained from the `ClusterSpec`. These parameters are prefixed
    with `ROACHTEST` (e.g., `ROACHTEST_cloud`).
    
    For now, just the cloud name, number of CPUs and SSDs are included. In
    the future, we can expand this work to support other relevant cluster
    configuration values as needed to debug test failures.
    
    Resolves: cockroachdb#80799.
    
    Release note: None.
    renatolabs authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    4469a5b View commit details
    Browse the repository at this point in the history
  3. roachtest: Return vm createoptions to allow computed values to be rep…

    …orted.
    
    Some runtime values are probabilistically set or otherwise have
    defaults which are not evident at the time we report issues.
    
    This change returns the vm options associated with a created cluster,
    from which we can extract the final computed values for such fields
    including the type of filesystem and whether encryption is enabled.
    
    Release justification: test-only change
    Release note: none
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    14dde64 View commit details
    Browse the repository at this point in the history
  4. roachtest: Refactor github posting to separate source file, add unit …

    …test.
    
    No previous coverage exists for posting issues to github. This change
    moves the relevant code into a separate source file with associated test.
    External functions dealing with actual issue posting to github, and loading
    the TEAMS.yaml are now injected to facilitate easier testing.
    
    Release justification: test-only change
    Release note: none
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    fa8af54 View commit details
    Browse the repository at this point in the history
  5. roachprod: fix roachprod init.

    When a cluster is started with the `--skip-init` option, the caller
    can run `roachprod init` at any time to initialize the
    cluster. Unfortunately, the code used to initialize the cluster was
    duplicated: one copy existed in the `start` path, and another in the
    `init` path. Since the latter is used far less frequently, it had a
    bug that went unnoticed: it hardcoded the first node index as `0`,
    when node indices start at 1.
    
    This commit fixes the issue by updating the constant and sharing code
    between `init `and `start`.
    
    Fixes cockroachdb#88226.
    
    Release note: None
    renatolabs authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    2ecd18f View commit details
    Browse the repository at this point in the history
  6. roachprod: set default cluster settings when starting

    In cockroachdb#88514, the cluster start logic was refactored to reuse the same
    code across `init` and `start`, fixing a bug in the former. However,
    the refactoring overlooked the fact that we previously always set the
    default cluster settings when there's more than one node in the
    cluster.
    
    This fixes that by setting the default cluster settings in that case;
    one particularly important cluster setting is the license key,
    necessary for some roachtests.
    
    Fixes cockroachdb#88660.
    Fixes cockroachdb#88665.
    Fixes cockroachdb#88666.
    Fixes cockroachdb#88710.
    
    Release note: None
    renatolabs authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    9c0b17d View commit details
    Browse the repository at this point in the history
  7. roachtest: use structured errors

    `roachtest` predates `cockroachdb/errors` and as a result so far hasn't
    capitalized many improvements. Today, roachtest errors are often noisy.
    
    This commit adopts `cockroachdb/errors` and reaps some of the immediate
    rewards, while making related improvements that would also have been
    possible without the adoption.
    
    The main user-facing change is that error messages in the output are now
    a lot more concise; these would previously - sometimes - include the
    entire stack trace. Now they contain only the topmost stack record of
    the innermost error:
    
    ```
    (test_test.go:129).func3: first error
    (test_test.go:130).func3: second error
    ```
    
    The full error continues to be logged, but we write it to files in
    the artifacts, where they can be inspected just in case. This now
    happens unconditionally for all errors, whereas the old code only
    logged the stacks if the error was reported in a certain way.
    
    Internally, the reorganization has also considerably simplified
    roachtest. Stack frame offset tracking has been removed, since
    `cockroachdb/errors` already handles it. Custom rendering code
    was similarly significantly trimmed down.
    
    In making this change, I opted to always have `roachtest` create
    loggers backed by real files. This was previously elided in tests,
    but this would have caused extra conditionals. It's better to
    standardize on the way in which `roachtest` is always used in
    actual invocations.
    
    Looking ahead, structured errors open a few avenues:
    
    - can assign the owner based on the type of failure. For example,
      cockroachdb#87106 wants consistency check failures to always go to the KV
      team, regardless of which test's cluster was being checked.
    
      Failures during an IMPORT/RESTORE (common across all tests) could be
      routed to the Disaster Recovery team by default (assuming we provide a
      wrapper around these operations that all tests use and which does this
      wrapping).
    - Similarly, SSH failures can be special cased via a marking error
      and can be directed away from the owning team, which can't do
      anything about them anyway (cockroachdb#82398).
    - We can conceivably start grouping failure modes by "error signature".
      That is, errors which have a "comparable" chain of errors (e.g. same
      types, and within formatted errors the same format string). Issues
      could then be reused only for compatible error signatures.
    
    Release note: None
    tbg authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    6e271d9 View commit details
    Browse the repository at this point in the history
  8. roachprod: log Get failures in roachtests

    `roachprod.Get` and `roachprod.Put` behaved differently when it comes
    to logging of errors: the former would not log errors found while
    running `scp` if there was a file backing the logger, while the latter
    would.
    
    However, there is no reason for this difference, and it was probably
    an oversight introduced in an old, really large PR (cockroachdb#74223). This
    commit makes behavior consistent in both calls and should allow us to
    see `Get` errors in logs, especially important in `roachtest`
    failures.
    
    Epic: None.
    
    Release note: None.
    renatolabs authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    9920a93 View commit details
    Browse the repository at this point in the history
  9. roachtest: Reassign roachtest failures which may have been caused by

     SSH flakes to the test-eng team.
    
     Also abstracts away how issue name and team is overriden based on
     an issue category, which can be ClusterCreation, SSH, or Other
    
     Resolves issue cockroachdb#82398
    
     Release justification: test-only change
     Release note: None
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    3badefb View commit details
    Browse the repository at this point in the history
  10. roachtest: Utilise structured errors for detection of SSH flakes.

    This commit utilises cockroachdb/errors markers api to mark an SSH
    error with exit code of 255. The test runner, when it is posting an
    issue to github, can act according to which marker may be present
    in t.mu.failures[0]. In this case, we override the owning team to
    protect them from having to investigate what is likely a transient
    SSH issue.
    
    A test now has the concept of a failure[] instead of error[]. Each
    failure contains the original `squashedErr` and an errors[],
    which are all the errors passed via t.Failf/Errorf.
    
    We can then preserve the relationship of multiple
    errors to any particular failure within a test, and match on any given
    reference error (like SSH flakes from above)
    
    This moves us away from substring matching on error messages.
    
    The test name is prepended to both cluster creation and ssh errors
    to ease troubleshooting as all those issues will be bucketed
    under either category respectively.
    
    Release justification: test-only change
    Release note: None
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    4628ef8 View commit details
    Browse the repository at this point in the history
  11. roachtest: Set localhost as the http listener when --local flag

    is used. This prevents OSX from prompting to allow incoming
    network connections when running a roachtest from an ide or
     shell.
    
    Release justification: test-only change
    Release note: None
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    e165d3c View commit details
    Browse the repository at this point in the history
  12. This change reverts an update to the fmtsafe.go linter output message.

    TestLint/TestRoachVet is a unit test which runs roachvet and greps various
    messages from its output - the reverted message being one of them.
    
    Release note: none.
    Epic: none
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    d974dc9 View commit details
    Browse the repository at this point in the history
  13. roachprod: improve error in ParallelE

    Prior to this commit, the error's stack trace did not link back
    to the caller of `ParallelE`. Now it does.
    
    Release note: None
    tbg authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    98f72f9 View commit details
    Browse the repository at this point in the history
  14. roachtest: This change introduces a transparent to caller, default re…

    …try when
    
    encountering an error of 255 (SSH). The retry could be exposed to
    callers fairly simply, but this PR does not introduce that.
    
    Today there appears the concept of a "roachprod" and a "command error",
    the former of which denotes an issue in roachprod handling code, the
    latter representing an error executing a command over SSH. This PR
    preserves this behaviour and introduces an updated function signature
    from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error).
    
    RunResultDetails has been expanded to capture information about the
    execution of a command, such as stderr/our, exit code, error, attempt
    number. Any roachprod error will result in an error being returned, and set
    in RunResultDetails.Err. Any command error would be only set in
    RunResultDetails.Err with the returned error being nil. This allows callers to
    differentiate between a roachprod and a command error, which existing code
    depends on.
    
    Retry handling code can use a function's returned RunResultDetails to
    determine whether a retry should take place. The default retry
    handling occurs on `RunResultDetails.RemoteExitStatus == 255`
    
    Release note: None
    Epic: CRDB-21386
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    958e79f View commit details
    Browse the repository at this point in the history
  15. roachtest: Retry failed SCP attempts in the same way as SSH commands.

    Any error returned is assumed to be retryable (in contrast to SSH where
    by default we look for an exit code of 255)
    
    Release note: none
    Epic: CRDB-21386
    Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    ddbfcd3 View commit details
    Browse the repository at this point in the history
  16. roachtest: fix error depth

    Errors were reported like this:
    
    > (test_impl.go:291).Fatal: pq: invalid syntax @1
    
    but that line is the impl of `Fatal`. We want to walk up one more frame.
    
    Epic: none
    Release note: None
    tbg authored and Miral Gadani committed Jan 18, 2023
    Configuration menu
    Copy the full SHA
    f7c1cce View commit details
    Browse the repository at this point in the history