You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
renatolabs opened this issue
Apr 6, 2023
· 3 comments
Assignees
Labels
A-testingTesting tools and infrastructureC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-testengTestEng Team
As of a few months ago, we started using ssh's exit code to identify "SSH flakes": failures that we believe are due to SSH connection as opposed to an actual error in the command itself. As time went on, I believe we started to see how this mechanism is less than ideal, at least the way it's implemented today: there's no way to differentiate a 255 exit code coming from ssh itself (arguably an SSH flake, see below) from one where the command being run returned 255 for its own reasons (which should translate to a legitimate test failure).
The following documents a few recent instances where test failures were marked as SSH flakes when they should have been legitimate test failures:
Jepsen test failures
We recently had to perform an update in our Jepsen fork [1] to fix something that caused the failure of every Jepsen roachtest. The Jepsen failures were marked as SSH flakes [2] because Jepsen will exit with code 255 when an unhandled exception is caught [3].
ruby-install failures
We have seen ruby-install failures being marked as SSH flakes for months as well [4]. Turns out ruby-install will exit with code -1 when it fails [5], which translates to an actual exit code of 255.
GetInternalIP failures GetInternalIP will return the output of hostname --all-ip-addresses. However, hostname itself will return 255 in some (maybe all?) error conditions (e.g., try hostname --invalid; echo $? on Linux).
More generally, there are a lot of commands that return 255 in error conditions [6], and in those cases we do want the test to fail.
How about when ssh itself returns 255 -- that's gotta be an SSH flake, right?
I'd say depends on how we define "flake". In [7], for example, we see that an authentication failure due to misconfigured SSH keys led to a 255 exit code and therefore (unwanted) retries. In this case, it's not a "flake" (in the sense that the command will fail deterministically), but it would be a failure that, were it to happen on a test, we wouldn't want to notify teams about.
As another example, take this GetInternalIP failure [8]. If we look at the associated SSH logs, we see that the actual reason for the failure is:
debug1: kex_exchange_identification: banner line 0: Exceeded MaxStartups
kex_exchange_identification: Connection closed by remote host
I won't get too much into the details of the semantics of MaxStartups; it's documented in [9]. Suffice it to say that it can cause ssh to randomly drop connections when there are too many unauthenticated SSH connections. It has previously been a source of flakes and back in 2019 (#37001) we bumped that limit. However, no amount of bumping can catch up with port scanners. Indeed, shortly before the failure in [8], we see a port scanner in action:
sshd[18234]: Invalid user ts3 from 171.244.62.120 port 60474
sshd[18235]: Invalid user minecraft from 171.244.62.120 port 60504
sshd[18232]: Invalid user pi from 171.244.62.120 port 60544
sshd[18227]: Invalid user testuser from 171.244.62.120 port 60340
sshd[18242]: Invalid user deploy from 171.244.62.120 port 60624
sshd[18241]: Invalid user test from 171.244.62.120 port 60566
sshd[18244]: Invalid user ansible from 171.244.62.120 port 60594
sshd[18239]: Invalid user bob from 171.244.62.120 port 60450
sshd[18238]: Invalid user test from 171.244.62.120 port 60664
sshd[18257]: Invalid user test from 171.244.62.120 port 60604
sshd[18258]: Invalid user testftp from 171.244.62.120 port 60546
sshd[18259]: Invalid user raspberry from 171.244.62.120 port 60432
sshd[18260]: Invalid user test from 171.244.62.120 port 60390
sshd[18262]: Invalid user hadoop from 171.244.62.120 port 60470
sshd[18261]: Invalid user ubnt from 171.244.62.120 port 60540
Which brings us to prevention; we shouldn't let random port scanners cause a test to fail.
This rambly issue can be summarized in a few points:
checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.
ideally, we would also not retry commands that fail due to authentication failures
we can probably do a better job at preventing SSH flakes, especially in cases where a port scanner causes sshd to drop connections. We can investigate to what extent we can re-enable sshguard [10], or a similar solution. Even better, the more long term approach would be to get rid of public IPs entirely, as proposed in roachprod: enable provisioning without public ip #77644.
The text was updated successfully, but these errors were encountered:
renatolabs
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-testing
Testing tools and infrastructure
T-testeng
TestEng Team
labels
Apr 6, 2023
Thanks for compiling this issue. I agree with the observations, especially in cases where commands themselves return 255
checking exit status from ssh to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.
That's essentially what we are doing now. Non-zero exit codes of the command are passed through to the ssh caller.
How else would you propose checking the exit status of the underlying command? If a command did return -1, it will only ever be visible as 255 (two's complement). Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.
The default way in which we determine if a command should be retried is relatively easily modified, as it has access to the stdout/err of the command, as well as the exit status, should we want to look for certain output. [1]
Is there a way to disambiguate from ssh returning 255, without nasty manipulation of the original command? One possible way is to match on substrings in the debug logs.
I was intentionally trying to document behavior we probably don't want (mark test as SSH flake if command returned 255) without prescribing a solution. That said, what you mentioned here was one of the things that came to mind: we can likely check SSH logs to determine the source of the 255. That would couple us to a specific logging format and we might need to revisit the logic when we upgrade SSH/Ubuntu, etc, but it might be a worthwhile compromise to make to avoid mislabeling issues.
There might be other solutions that I'm not aware of as well.
A-testingTesting tools and infrastructureC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-testengTestEng Team
As of a few months ago, we started using ssh's exit code to identify "SSH flakes": failures that we believe are due to SSH connection as opposed to an actual error in the command itself. As time went on, I believe we started to see how this mechanism is less than ideal, at least the way it's implemented today: there's no way to differentiate a 255 exit code coming from
ssh
itself (arguably an SSH flake, see below) from one where the command being run returned 255 for its own reasons (which should translate to a legitimate test failure).The following documents a few recent instances where test failures were marked as SSH flakes when they should have been legitimate test failures:
Jepsen test failures
We recently had to perform an update in our Jepsen fork [1] to fix something that caused the failure of every Jepsen roachtest. The Jepsen failures were marked as SSH flakes [2] because Jepsen will exit with code 255 when an unhandled exception is caught [3].
ruby-install failures
We have seen
ruby-install
failures being marked as SSH flakes for months as well [4]. Turns outruby-install
will exit with code-1
when it fails [5], which translates to an actual exit code of 255.GetInternalIP failures
GetInternalIP
will return the output ofhostname --all-ip-addresses
. However,hostname
itself will return 255 in some (maybe all?) error conditions (e.g., tryhostname --invalid; echo $?
on Linux).More generally, there are a lot of commands that return 255 in error conditions [6], and in those cases we do want the test to fail.
How about when
ssh
itself returns 255 -- that's gotta be an SSH flake, right?I'd say depends on how we define "flake". In [7], for example, we see that an authentication failure due to misconfigured SSH keys led to a 255 exit code and therefore (unwanted) retries. In this case, it's not a "flake" (in the sense that the command will fail deterministically), but it would be a failure that, were it to happen on a test, we wouldn't want to notify teams about.
As another example, take this
GetInternalIP
failure [8]. If we look at the associated SSH logs, we see that the actual reason for the failure is:I won't get too much into the details of the semantics of
MaxStartups
; it's documented in [9]. Suffice it to say that it can cause ssh to randomly drop connections when there are too many unauthenticated SSH connections. It has previously been a source of flakes and back in 2019 (#37001) we bumped that limit. However, no amount of bumping can catch up with port scanners. Indeed, shortly before the failure in [8], we see a port scanner in action:Which brings us to prevention; we shouldn't let random port scanners cause a test to fail.
This rambly issue can be summarized in a few points:
ssh
to determine whether something was an SSH flake is too brittle. We want to check the exit status of the command being run at the very least.sshguard
[10], or a similar solution. Even better, the more long term approach would be to get rid of public IPs entirely, as proposed in roachprod: enable provisioning without public ip #77644.[1] cockroachdb/jepsen#34
[2] Example: #90695 (comment)
[3] https://github.com/cockroachdb/jepsen/blob/3d7c345d6958f067edb097f9b82ab8e7a4a752c7/jepsen/src/jepsen/cli.clj#L278-L280
[4] Example: #99828 (comment)
[5] https://github.com/postmodern/ruby-install/blob/f59dd9cb31a073c2e2c94eb679a348f7ec3ca1b1/share/ruby-install/logging.sh#L71-L75
[6] Others have expressed their confusion on the topic as well, see https://www.redhat.com/sysadmin/exit-codes-demystified for example.
[7] https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1679342613439679 (internal)
[8] #99828 (comment)
[9] https://man7.org/linux/man-pages/man5/sshd_config.5.html
[10] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/vm/gce/utils.go#L134-L136
Jira issue: CRDB-26665
The text was updated successfully, but these errors were encountered: