Improve #196 and #211 #214

mmirate · 2023-11-13T22:20:35Z

Rewrite these two features using disjunctive futures in order to cleanly make the timers optional.
This works because of https://docs.rs/futures/latest/futures/future/enum.Either.html#impl-Future-for-Either%3CA,+B%3E.
It theoretically works best because reusing the same futures means Tokio has to do less bookkeeping. Every non-diverging exit from the select! will reset the keepalive timer, and will also reset the inactivity timer if any nontrivial I/O happened.
Add an analogue of OpenSSH's ServerAliveCountMax.
A counter is persisted on the Session; incremented when a keepalive is sent; and reset when a REQUEST_SUCCESS or REQUEST_FAILURE is received (presumably in reply to a keepalive). If it passes a configurable threshold, the connection is cut.
Mutate the Session to pass information back to the main bg loop from the plaintext packet reader, so that only nontrivial data transfer will reset the inactivity timer. (And so that ServerAliveCountMax will be judged correctly.)

mmirate · 2023-11-13T22:21:47Z

This is related to #196 and #211 but I guess that putting the numbers in the title doesn't do anything. @amtelekom, how badly did I misunderstand what you were looking for?

russh/src/client/mod.rs

amtelekom · 2023-11-14T15:10:22Z

how badly did I misunderstand what you were looking for?

First of all thanks for looking out for my needs :)

Your branch works fine for our purposes - we just need to be able to detect if the network connectivity was lost or the Server exit without closing down the connection correctly on long-standing SSH connections (Netconf transport tunnels).

Since our (OpenSSH-based) servers can send keepalives, just copying the inactivity_timeout from the Russh server side was the quickest solution, but actually sending own keepalives and then timing out on non-responsiveness is actually the better solution to our specific situation.

While it works, it feels like the activity tracking seems like it goes farther than needed?
As mentioned, using the inactivity_timeout config just came from the Russh Server side, where it is just used to add a timeout to the select! loop.
In your current PR, the interpretation of the flag differ between the Server and Client side.

It might still be the better way of tracking "inactivity", but then maybe we should also port this to the Server side for consistency, and expand the explanation in the doc comment in the Config struct?

mmirate · 2023-11-14T21:29:34Z

First of all thanks for looking out for my needs :)

Your branch works fine for our purposes - we just need to be able to detect if the network connectivity was lost or the Server exit without closing down the connection correctly on long-standing SSH connections (Netconf transport tunnels).

Well, I would marvel at the coincidence that our respective employers have the same motivation for us to use russh, but I guess there aren't that many other client-side usecases of SSH that can't be done "simpler" by just shelling out to OpenSSH's client program with tokio::process. (Though I'd assume that that would be less scalable since certainly each russh Session has to be using less memory than each entire ssh process...)

Since our (OpenSSH-based) servers can send keepalives, just copying the inactivity_timeout from the Russh server side was the quickest solution, but actually sending own keepalives and then timing out on non-responsiveness is actually the better solution to our specific situation.

Ah, so that feature's use case now makes a lot more sense. I guess that's what I might have ended up doing, if I had any influence over the SSH servers' configuration.

While it works, it feels like the activity tracking seems like it goes farther than needed?

My best guess is that what you meant is: the activity tracking should only say "even though a packet was read, it didn't merit resetting the inactivity timer" if the packet in-question was clearly a keepalive request or reply. Is this correct?

(By contrast, what I wrote will also avoid resetting the inactivity timer in case of XON/XOFF, unhandled global requests, unhandled channel requests and unhandled packets. Upon reflection, I see that's clearly a subjective judgment of mine.)

As mentioned, using the inactivity_timeout config just came from the Russh Server side, where it is just used to add a timeout to the select! loop. In your current PR, the interpretation of the flag differ between the Server and Client side.

It might still be the better way of tracking "inactivity", but then maybe we should also port this to the Server side for consistency, and expand the explanation in the doc comment in the Config struct?

You're right, I have thusfar been paying zero attention to the Server side of russh. I'll take some time to see where this all fits into the Server.

mmirate · 2023-11-20T02:31:26Z

I'll take some time to see where this all fits into the Server.

I have concluded this process, and thus I believe my recent commit brings Client and Server into parity regarding these two features.

Based on what I saw on the Server side - and contrary to the code I wrote in my previous PR - it seems to me that the purpose of the inactivity timer, relative to OpenSSH-style ServerAliveInterval or ClientAliveInterval, is that the inactivity timer on the one hand doesn't let the peer hold the connection open by responding to a ping-pong, but on the other hand does let the local application hold the connection open by sending data to the peer even without expecting any responses whatsoever to that data (not even window size adjustments or rekeyings or etc). So that is what I have assumed when writing this last commit.

amtelekom · 2023-11-22T10:02:21Z

Hmm,

running your updated commit with

let config = Arc::new(sshclient::Config {
  keepalive_interval: Some(Duration::from_secs(60))
  ..sshclient::Config::default()
});
let client = SSHClient {};
let mut ssh = sshclient::connect(config, socket_addr, client).await?;
// ...

against a default-configuration OpenSSH Server (i.e. it doesn't send keepalives on its own) keeps runing into timeouts.

Activating the Debug outputs for russh, it seems like there is no wait time between the keepalive attempts being sent, and thereby also not giving the remote side more than a few ns to respond (the previous russh debugmsg timestamp before this is exactly a minute older):

2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 558   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 559   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 560   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::cipher: writing, seqn = 561   
2023-11-22T09:53:18Z DEBUG russh::cipher: padding length 4   
2023-11-22T09:53:18Z DEBUG russh::cipher: packet_length 32   
2023-11-22T09:53:18Z DEBUG russh::client: Timeout, server not responding to keepalives   
2023-11-22T09:53:18Z DEBUG russh::client: disconnected   
2023-11-22T09:53:18Z DEBUG russh::client: drop session

I've pushed a fix to https://github.com/amtelekom/russh/tree/keepalive-improvement (this also fixes that an additional keepalive request is sent right before disconnecting [note the 4 seqns in the previous log], but that's more of a nicety than a functional improvement). Not sure how to best collaborate on PRs on Github :)

mmirate · 2023-11-22T18:18:42Z

Aye, that was a nasty little oversight. Apologies. It should work correctly with that simple fix, and since your code was slightly cleaner than what I came up with prior to reading the entirety of your message, I cherry-picked your commit into here.

mmirate · 2023-12-06T17:29:51Z

The semver checker failed when I rebased this onto the latest commit in main; I assume because the version numbers no longer have the prerelease indicator that allows breaking changes (e.g. these).

mmirate · 2023-12-18T15:52:47Z

Heads-up, @Eugeny: it looks like russh-sftp is using features that weren't stable in Rust 1.65.

Eugeny · 2023-12-18T16:39:45Z

fixed!

* Add an analogue of OpenSSH's `ServerAliveCountMax`. * Use disjunctive futures for cleanly making these timers optional. * Use the `Session` to pass information back to the main bg loop from the plaintext packet reader, so that only nontrivial data transfer will reset the inactivity timer. (And so that `ServerAliveCountMax` will be judged correctly.)

Also bring client and server into parity regarding timers. Also, per OpenSSH documentation, only reset keepalive timer when receiving data, not when sending it. Also, always reset the inactivity timer unless the iteration was ended via sending a keepalive request.

mmirate · 2023-12-18T20:02:08Z

Excellent! I took an educated guess at the version-bump, so CI seems to be in order now.

amtelekom · 2024-01-15T16:04:51Z

Is there something that still needs to happen before this can be merged?

In case it helps: I feel this MR now implements things as I would have expected them to work from a user's perspective, and we've been working based on this branch successfully the last weeks. Would be good to be able to go back to the upstream version though :)

Eugeny · 2024-01-18T09:55:13Z

No, sorry, just lost the track of it

Eugeny reviewed Nov 13, 2023

View reviewed changes

russh/src/client/mod.rs Outdated Show resolved Hide resolved

mmirate force-pushed the keepalive-improvement branch from 5ec4f7c to a05b7ee Compare November 20, 2023 02:25

mmirate force-pushed the keepalive-improvement branch from f8d8567 to 9cd45b9 Compare November 22, 2023 18:17

mmirate force-pushed the keepalive-improvement branch from 9cd45b9 to f5b4670 Compare December 6, 2023 16:10

mmirate requested a review from Eugeny December 6, 2023 17:47

mmirate and others added 3 commits December 18, 2023 11:48

Reset keepalive-timer after sending keepalive

7b640e0

mmirate force-pushed the keepalive-improvement branch from 1c0846d to 7b640e0 Compare December 18, 2023 16:49

mmirate added 2 commits December 18, 2023 12:00

Fix formatting

18918f6

v0.41.0-beta.1

0acde18

Eugeny merged commit 2a4b5a0 into Eugeny:main Jan 18, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve #196 and #211 #214

Improve #196 and #211 #214

mmirate commented Nov 13, 2023

mmirate commented Nov 13, 2023 •

edited

Loading

amtelekom commented Nov 14, 2023

mmirate commented Nov 14, 2023 •

edited

Loading

mmirate commented Nov 20, 2023 •

edited

Loading

amtelekom commented Nov 22, 2023 •

edited

Loading

mmirate commented Nov 22, 2023

mmirate commented Dec 6, 2023 •

edited

Loading

mmirate commented Dec 18, 2023

Eugeny commented Dec 18, 2023

mmirate commented Dec 18, 2023

amtelekom commented Jan 15, 2024 •

edited

Loading

Eugeny commented Jan 18, 2024

Improve #196 and #211 #214

Improve #196 and #211 #214

Conversation

mmirate commented Nov 13, 2023

mmirate commented Nov 13, 2023 • edited Loading

amtelekom commented Nov 14, 2023

mmirate commented Nov 14, 2023 • edited Loading

mmirate commented Nov 20, 2023 • edited Loading

amtelekom commented Nov 22, 2023 • edited Loading

mmirate commented Nov 22, 2023

mmirate commented Dec 6, 2023 • edited Loading

mmirate commented Dec 18, 2023

Eugeny commented Dec 18, 2023

mmirate commented Dec 18, 2023

amtelekom commented Jan 15, 2024 • edited Loading

Eugeny commented Jan 18, 2024

mmirate commented Nov 13, 2023 •

edited

Loading

mmirate commented Nov 14, 2023 •

edited

Loading

mmirate commented Nov 20, 2023 •

edited

Loading

amtelekom commented Nov 22, 2023 •

edited

Loading

mmirate commented Dec 6, 2023 •

edited

Loading

amtelekom commented Jan 15, 2024 •

edited

Loading