Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roachtest redirect SSH flakes to test-eng #88492

Merged
merged 3 commits into from
Oct 25, 2022

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Sep 22, 2022

See second commit note at the bottom

This PR inspects the failure output of a roachtest, and if it sees an SSH_PROBLEM, overrides the owning team to test-eng when reporting the github issue.

Currently errors are classified as an SSH error by roachprod if the exit code is 255 with an accompanying message prefixed with SSH_PROBLEM [1]. The errors are stringified and saved into t.mu.output|failureMsg. Thus in the test_runner at the call site of issue posting, we can check t.mu.output for SSH_PROBLEM and override the team and issue name accordingly.

Resolves: #82398

Release justification: test-only change
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch 3 times, most recently from bebfd0c to e135b80 Compare September 22, 2022 19:35
@smg260 smg260 requested a review from tbg September 22, 2022 19:36
@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch from e135b80 to ba42dc5 Compare September 22, 2022 20:18
@smg260 smg260 marked this pull request as ready for review September 23, 2022 01:47
@smg260 smg260 requested a review from a team as a code owner September 23, 2022 01:47
@smg260 smg260 requested review from renatolabs and removed request for a team September 23, 2022 01:47
@tbg
Copy link
Member

tbg commented Sep 23, 2022

We can also build on top of #88556 for this, I have a DM going with Miral.

@tbg tbg removed their request for review September 26, 2022 07:25
@tbg
Copy link
Member

tbg commented Sep 26, 2022

@smg260 (correct me if I'm wrong please) is getting #88556 in first and then re-working this to build on top of that PR.

I'm unassigning for review so that it doesn't show up in my queue in the meantime.

@smg260
Copy link
Contributor Author

smg260 commented Sep 27, 2022

@tbg Yep. Marking as draft for now.

@smg260 smg260 marked this pull request as draft September 27, 2022 13:55
@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch from ba42dc5 to e7e4289 Compare October 4, 2022 14:29
@smg260
Copy link
Contributor Author

smg260 commented Oct 4, 2022

This is rebased on top of #88556 which introduces structured errors.

Detecting the SSH flake via substring still works and keeps this PR simple.

This might be a good place to introduce some retry logic for running the SSH commands

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch from 09d6d85 to 70db102 Compare October 5, 2022 17:38
@srosenberg
Copy link
Member

Detecting the SSH flake via substring still works and keeps this PR simple.

A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like errCategory with cluster creation and ssh errors injecting the corresponding category at the site?

This might be a good place to introduce some retry logic for running the SSH commands

There is a separate issue to deal with SSH retries: #73542

@srosenberg
Copy link
Member

I recently spotted another issue... seeing this error message quite often during cluster teardown,

teardown: 16:53:25 cluster.go:1149: failed to fetch logs: cluster.Get: get logs failed

The above error message originates in [1]. The problem is that the actual error message is swallowed [2] because l.File is usually non-nil, when invoked from roachtest. (The logger is essentially the root logger and l.File is test.log.) l believe the same issue applies to Put.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2034
[2] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2028

@renatolabs renatolabs marked this pull request as ready for review October 6, 2022 14:02
Copy link
Collaborator

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured errors should make it cleaner and pave the way for more cases

I agree with this, I think structured errors open the door for us to do things based on the error types. I think that's cleaner and also wouldn't break if someone changed how errors.SSH is formatted.

As general food for thought for everyone: IMO, errors like cluster creation and SSH problems are better reported in a more concise way other than a GitHub issue. We just want to be notified about them, but there's little (nothing?) we can do about them other than acknowledge and close the issue when it's opened.

One idea is to post something on Slack (in the existing test-eng-ops) channel with a summary of a roachtest nightly run, including how many tests were skipped due to cluster creation errors, SSH flakes, etc.

pkg/cmd/roachtest/github.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/github.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/github.go Show resolved Hide resolved
@renatolabs
Copy link
Collaborator

@smg260 I pointed a couple of misplaced comments on #88556. Maybe we could fix those here?

@smg260
Copy link
Contributor Author

smg260 commented Oct 10, 2022

A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like errCategory with cluster creation and ssh errors injecting the corresponding category at the site?

That's definitely the goal, and my thoughts were to introduce that when we switch to cockroachdb/errors. We'd then be able to annotate errors using something like errors.Mark(err, SSHerr) or `errors.Mark(err, ClusterCreationErr) at the point we detect it. (tbg's idea).

So we can keep this PR tight in scope and address just the SSH flakes, or expand it to more generically handle different error types. The PR does at least currently constrain the logic to to determine the owning team in githib.go (as it should).

There is a separate issue to deal with SSH retries: #73542

👍 This was getting a bit gnarly to implement in my test branch

@smg260
Copy link
Contributor Author

smg260 commented Oct 10, 2022

As general food for thought for everyone: IMO, errors like cluster creation and SSH problems are better reported in a more concise way other than a GitHub issue. We just want to be notified about them, but there's little (nothing?) we can do about them other than acknowledge and close the issue when it's opened.

One idea is to post something on Slack (in the existing test-eng-ops) channel with a summary of a roachtest nightly run, including how many tests were skipped due to cluster creation errors, SSH flakes, etc.

I agree that some errors are better reported via other mediums such as Slack, but having the same error reported a shared github issue (or similar) provides a better system of record. Ultimately, where an issue is posting should be agnostic to the caller and the posting code could drive which channels are notified via some coniguration

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch 2 times, most recently from 3ab798e to 7ae89fd Compare October 12, 2022 00:31
@smg260
Copy link
Contributor Author

smg260 commented Oct 12, 2022

The second commit switches to using errors/marker api to mark an SSH flake where it is detected in roachprod. This allows us to look for a marker exception and act on it accordingly.

Summary of changes:

  1. switch from []error to []failure in t.mu where failure is a struct containing all errors which were passed as args. The impetus for this was to preserve secondary+ exceptions which are passed into the t.Error/Fatal methods. Currently when t.Error(someErr) is invoked, a new error is created, moving someErr to a secondary error, out of the primary error chain. This makes is very hard to search for the marker which would have been inserted at someErr but is now not visible to errors.Is(..)
  2. Add ErrSSH255 and errClusterProvisioningFailed reference errors (the marker errors). We control what we post in the issue according to if we find these
  3. remove unused members of t.mu
  4. remove t.argsToErr in favour of errors.NewWithDepthf

@smg260 smg260 requested review from srosenberg and tbg October 12, 2022 00:32
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/cmd/roachtest/github.go line 137 at r1 (raw file):

Previously, smg260 (Miral Gadani) wrote…

It probably makes sense to keep SSH flakes together, instead of having a new one created for each incidence.

We can prepend the test name to the message body like cluster creation does?

I agree that this isn't excessively pretty, but will it maybe get the job done just fine? For example, the cluster_creation messages mention the test name sufficiently prominently:

#89810

Not sure if this is the case here too (since there won't be a .Skip()) but the message also contains the test name right at the top: #90045 (comment)

We can also go through the refactor and add a title override that we can use instead of the test name. I'm just not sure it's worth the time.


pkg/cmd/roachtest/github.go line 40 at r2 (raw file):

	clusterCreationErr issueCategory = iota
	sshErr
	otherErr

tiny nit: since otherErr is the catch-all, it should be represented by the zero value.


pkg/cmd/roachtest/github.go line 162 at r3 (raw file):

	//TODO: perhaps remove category completely and move this to
	// issuePoster?

The poster has multiple callers and so I think this should continue to live in roachtest.


pkg/cmd/roachtest/test_impl.go line 45 at r3 (raw file):

	// errors are all the errors passed to a single invokation of
	// `addFailure`
	errors []error

Add some color here, why are we stashing these errors? It's because they might be structured and we want to preserve that (which fmt.Sprintf(...) wouldn't do). Maybe given an example that t.Fatalf("foo %s %s %s", "hello", err1, err2) would lead to []error{err1,err2}.


pkg/cmd/roachtest/test_impl.go line 280 at r3 (raw file):

// from a test's closure. The test runner itself should never call this.
func (t *testImpl) Fatal(args ...interface{}) {
	t.Fatalf("%v", args...)

This will print t.Fatal("hi") as ["hi"] which isn't what we want. I think you want something like

t.Fatalf(strings.Join(strings.Repeat(len(args), " %v"))[1:], args...)

pkg/cmd/roachtest/test_impl.go line 287 at r3 (raw file):

	errs := []error{errors.NewWithDepthf(1, format, args...)}
	errs = append(errs, collectErrors(args...)...)
	t.addFailure(newFailure(errs...))

The symmetry in newFailure is a little misleading then since the first error is special, how about

type failure struct {
  squashedErr error
  wrapped []error
}

or something like that?


pkg/cmd/roachtest/test_impl.go line 298 at r3 (raw file):

func (t *testImpl) Error(args ...interface{}) {
	t.Errorf("%v", args...)

ditto


pkg/roachprod/errors/errors.go line 123 at r3 (raw file):

	if exitErr, ok := asExitError(err); ok {
		if exitErr.ExitCode() == 255 {
			return SSH{errors.Mark(err, ErrSSH255)}

Does this work? Does SSH{} implement the necessary Unwrap() method?

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch from 7beb8a9 to 4d32e19 Compare October 17, 2022 17:37
Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @srosenberg, and @tbg)


pkg/cmd/roachtest/github.go line 105 at r1 (raw file):

Previously, smg260 (Miral Gadani) wrote…

that's a good catch - it would have been breaking on master too but we only override for cluster creation there

Done


pkg/cmd/roachtest/github.go line 162 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The poster has multiple callers and so I think this should continue to live in roachtest.

Yep sorry - unclear TODO. Was a thought to myself to remove the idea of a category, and instead depend completely on reference errors. issuePoster here refers to the createPostRequest function in roachtest/github.go


pkg/cmd/roachtest/test_impl.go line 280 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This will print t.Fatal("hi") as ["hi"] which isn't what we want. I think you want something like

t.Fatalf(strings.Join(strings.Repeat(len(args), " %v"))[1:], args...)

Done. Thank you. Got lost in the variadic args being passed from func to func.


pkg/cmd/roachtest/test_impl.go line 287 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The symmetry in newFailure is a little misleading then since the first error is special, how about

type failure struct {
  squashedErr error
  wrapped []error
}

or something like that?

Agreed. That is much better.


pkg/roachprod/errors/errors.go line 123 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does this work? Does SSH{} implement the necessary Unwrap() method?

Yep . In roachprod/errors.go

Code snippet:

func (e SSH) Unwrap() error {
	return e.Err
}

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch 2 times, most recently from 5c51602 to 921c153 Compare October 18, 2022 17:57
@tbg tbg self-requested a review October 24, 2022 05:39
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


-- commits line 39 at r6:
💌


pkg/cmd/roachtest/test_impl.go line 48 at r6 (raw file):

// being captured in the squashedErr
type failure struct {
	// this is the single error created from variadic args passed to t.Error(f)/t.Fatal(f)

general comment, our style guide asks for complete sentences incl punctuation and capitalization (except for inline comments where it's the opposite), i.e.

// This is the single error created from variadic args passed to t.{Fatal,Error}{,f}.

I usually don't point out isolated incidents of this but it looks like it's happening throughout here, it will save you lots of nits down the road to get the muscle memory right. :-)


pkg/cmd/roachtest/test_impl.go line 286 at r6 (raw file):

// from a test's closure. The test runner itself should never call this.
func (t *testImpl) Fatal(args ...interface{}) {
	t.Error(args...)

ditto


pkg/cmd/roachtest/test_impl.go line 292 at r6 (raw file):

// Fatalf is like Fatal, but takes a format string.
func (t *testImpl) Fatalf(format string, args ...interface{}) {
	t.Errorf(format, args...)

Yo zneed to call addFailure because you'l


pkg/cmd/roachtest/test_impl.go line 298 at r6 (raw file):

// FailNow implements the TestingT interface.
func (t *testImpl) FailNow() {
	t.addFailure("FailNow called", nil)

passing nil here doesn't seem right, FailNow called doesn't have a verb in it and besides a nil doesn't add any info.


pkg/cmd/roachtest/test_impl.go line 400 at r6 (raw file):

// failureContainsError returns true if any of the errors in a given failure
// matches the reference error
func failureContainsError(refError error, f failure) bool {

nit: switch the order of the arguments, to reflect "failure contains error", not "error contains failure".


pkg/testutils/lint/passes/fmtsafe/functions.go line 96 at r6 (raw file):

	// Both of these signatures need to be included for the linter to not flag
	// roachtest testImpl.addFailure since it is in the main package
	// This could be a bug in nogo

Did you mean to leave this? If so, mind linking an issue?

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch 2 times, most recently from 77cfc3e to 5b80a31 Compare October 24, 2022 15:18
Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @srosenberg, and @tbg)


pkg/cmd/roachtest/test_impl.go line 48 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

general comment, our style guide asks for complete sentences incl punctuation and capitalization (except for inline comments where it's the opposite), i.e.

// This is the single error created from variadic args passed to t.{Fatal,Error}{,f}.

I usually don't point out isolated incidents of this but it looks like it's happening throughout here, it will save you lots of nits down the road to get the muscle memory right. :-)

👍 Thanks, missed the capitalisation. I don't think it's happened anywhere else though


pkg/cmd/roachtest/test_impl.go line 286 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto

What is missing here?


pkg/cmd/roachtest/test_impl.go line 292 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yo zneed to call addFailure because you'l

Cut off comment, but addFailure is being called via t.Errorf

@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch 2 times, most recently from 8c0a2a7 to 052a360 Compare October 24, 2022 18:59
@smg260
Copy link
Contributor Author

smg260 commented Oct 24, 2022

Test name will be prepended to issue body in the case of cluster creation AND ssh flakes.

Other conversations resolved. Will merge after TC builds are green.

Miral Gadani added 3 commits October 25, 2022 16:09
 SSH flakes to the test-eng team.

 Also abstracts away how issue name and team is overriden based on
 an issue category, which can be ClusterCreation, SSH, or Other

 Resolves issue cockroachdb#82398

 Release justification: test-only change
 Release note: None
This commit utilises cockroachdb/errors markers api to mark an SSH
error with exit code of 255. The test runner, when it is posting an
issue to github, can act according to which marker may be present
in t.mu.failures[0]. In this case, we override the owning team to
protect them from having to investigate what is likely a transient
SSH issue.

A test now has the concept of a failure[] instead of error[]. Each
failure contains the original `squashedErr` and an errors[],
which are all the errors passed via t.Failf/Errorf.

We can then preserve the relationship of multiple
errors to any particular failure within a test, and match on any given
reference error (like SSH flakes from above)

This moves us away from substring matching on error messages.

The test name is prepended to both cluster creation and ssh errors
to ease troubleshooting as all those issues will be bucketed
under either category respectively.

Release justification: test-only change
Release note: None
is used. This prevents OSX from prompting to allow incoming
network connections when running a roachtest from an ide or
 shell.

Release justification: test-only change
Release note: None
@smg260 smg260 force-pushed the roachtest_redirect_ssh_flakes branch from 052a360 to 741a235 Compare October 25, 2022 16:48
@smg260
Copy link
Contributor Author

smg260 commented Oct 25, 2022

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@smg260
Copy link
Contributor Author

smg260 commented Oct 25, 2022

bors retry

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Already running a review

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: special-case SSH_PROBLEM as infra flakes
5 participants