roachtest: re-evaluate SSH flake detection and prevention #100875

renatolabs · 2023-04-06T20:54:19Z

As of a few months ago, we started using ssh's exit code to identify "SSH flakes": failures that we believe are due to SSH connection as opposed to an actual error in the command itself. As time went on, I believe we started to see how this mechanism is less than ideal, at least the way it's implemented today: there's no way to differentiate a 255 exit code coming from ssh itself (arguably an SSH flake, see below) from one where the command being run returned 255 for its own reasons (which should translate to a legitimate test failure).

The following documents a few recent instances where test failures were marked as SSH flakes when they should have been legitimate test failures:

Jepsen test failures
We recently had to perform an update in our Jepsen fork [1] to fix something that caused the failure of every Jepsen roachtest. The Jepsen failures were marked as SSH flakes [2] because Jepsen will exit with code 255 when an unhandled exception is caught [3].

ruby-install failures
We have seen ruby-install failures being marked as SSH flakes for months as well [4]. Turns out ruby-install will exit with code -1 when it fails [5], which translates to an actual exit code of 255.

GetInternalIP failures
GetInternalIP will return the output of hostname --all-ip-addresses. However, hostname itself will return 255 in some (maybe all?) error conditions (e.g., try hostname --invalid; echo $? on Linux).

More generally, there are a lot of commands that return 255 in error conditions [6], and in those cases we do want the test to fail.

How about when ssh itself returns 255 -- that's gotta be an SSH flake, right?
I'd say depends on how we define "flake". In [7], for example, we see that an authentication failure due to misconfigured SSH keys led to a 255 exit code and therefore (unwanted) retries. In this case, it's not a "flake" (in the sense that the command will fail deterministically), but it would be a failure that, were it to happen on a test, we wouldn't want to notify teams about.

As another example, take this GetInternalIP failure [8]. If we look at the associated SSH logs, we see that the actual reason for the failure is:

debug1: kex_exchange_identification: banner line 0: Exceeded MaxStartups
kex_exchange_identification: Connection closed by remote host

I won't get too much into the details of the semantics of MaxStartups; it's documented in [9]. Suffice it to say that it can cause ssh to randomly drop connections when there are too many unauthenticated SSH connections. It has previously been a source of flakes and back in 2019 (#37001) we bumped that limit. However, no amount of bumping can catch up with port scanners. Indeed, shortly before the failure in [8], we see a port scanner in action:

sshd[18234]: Invalid user ts3 from 171.244.62.120 port 60474
sshd[18235]: Invalid user minecraft from 171.244.62.120 port 60504
sshd[18232]: Invalid user pi from 171.244.62.120 port 60544
sshd[18227]: Invalid user testuser from 171.244.62.120 port 60340
sshd[18242]: Invalid user deploy from 171.244.62.120 port 60624
sshd[18241]: Invalid user test from 171.244.62.120 port 60566
sshd[18244]: Invalid user ansible from 171.244.62.120 port 60594
sshd[18239]: Invalid user bob from 171.244.62.120 port 60450
sshd[18238]: Invalid user test from 171.244.62.120 port 60664
sshd[18257]: Invalid user test from 171.244.62.120 port 60604
sshd[18258]: Invalid user testftp from 171.244.62.120 port 60546
sshd[18259]: Invalid user raspberry from 171.244.62.120 port 60432
sshd[18260]: Invalid user test from 171.244.62.120 port 60390
sshd[18262]: Invalid user hadoop from 171.244.62.120 port 60470
sshd[18261]: Invalid user ubnt from 171.244.62.120 port 60540

Which brings us to prevention; we shouldn't let random port scanners cause a test to fail.

This rambly issue can be summarized in a few points:

checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.
ideally, we would also not retry commands that fail due to authentication failures
we can probably do a better job at preventing SSH flakes, especially in cases where a port scanner causes sshd to drop connections. We can investigate to what extent we can re-enable sshguard [10], or a similar solution. Even better, the more long term approach would be to get rid of public IPs entirely, as proposed in roachprod: enable provisioning without public ip #77644.

[1] cockroachdb/jepsen#34
[2] Example: #90695 (comment)
[3] https://github.com/cockroachdb/jepsen/blob/3d7c345d6958f067edb097f9b82ab8e7a4a752c7/jepsen/src/jepsen/cli.clj#L278-L280
[4] Example: #99828 (comment)
[5] https://github.com/postmodern/ruby-install/blob/f59dd9cb31a073c2e2c94eb679a348f7ec3ca1b1/share/ruby-install/logging.sh#L71-L75
[6] Others have expressed their confusion on the topic as well, see https://www.redhat.com/sysadmin/exit-codes-demystified for example.
[7] https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1679342613439679 (internal)
[8] #99828 (comment)
[9] https://man7.org/linux/man-pages/man5/sshd_config.5.html
[10] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/vm/gce/utils.go#L134-L136

Jira issue: CRDB-26665

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-04-06T20:54:22Z

cc @cockroachdb/test-eng

smg260 · 2023-04-06T21:37:03Z

Thanks for compiling this issue. I agree with the observations, especially in cases where commands themselves return 255

checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.

That's essentially what we are doing now. Non-zero exit codes of the command are passed through to the ssh caller.

How else would you propose checking the exit status of the underlying command? If a command did return -1, it will only ever be visible as 255 (two's complement). Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.

The default way in which we determine if a command should be retried is relatively easily modified, as it has access to the stdout/err of the command, as well as the exit status, should we want to look for certain output. [1]

[1] -

cockroach/pkg/roachprod/install/cluster_synced.go

Line 130 in 1ab74fa

    
           var DefaultSSHRetryOpts = newRunRetryOpts(defaultRetryOpt, func(res *RunResultDetails) bool { return errors.Is(res.Err, rperrors.ErrSSH255) })

renatolabs · 2023-04-10T13:56:49Z

Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.

I was intentionally trying to document behavior we probably don't want (mark test as SSH flake if command returned 255) without prescribing a solution. That said, what you mentioned here was one of the things that came to mind: we can likely check SSH logs to determine the source of the 255. That would couple us to a specific logging format and we might need to revisit the logic when we upgrade SSH/Ubuntu, etc, but it might be a worthwhile compromise to make to avoid mislabeling issues.

There might be other solutions that I'm not aware of as well.

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Apr 6, 2023

renatolabs mentioned this issue Apr 14, 2023

roachprod: only use cluster with positive lifetime #101434

Open

srosenberg mentioned this issue May 31, 2023

roachtest: ssh_problem failed #90695

Open

exalate-issue-sync bot assigned smg260 Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: re-evaluate SSH flake detection and prevention #100875

roachtest: re-evaluate SSH flake detection and prevention #100875

renatolabs commented Apr 6, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Apr 6, 2023

smg260 commented Apr 6, 2023

renatolabs commented Apr 10, 2023

roachtest: re-evaluate SSH flake detection and prevention #100875

roachtest: re-evaluate SSH flake detection and prevention #100875

Comments

renatolabs commented Apr 6, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Apr 6, 2023

smg260 commented Apr 6, 2023

renatolabs commented Apr 10, 2023

renatolabs commented Apr 6, 2023 •

edited by cockroach-jira-scripts

Loading