Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection closed by remote #636

Closed
ThomasWaldmann opened this issue Feb 6, 2016 · 33 comments
Closed

connection closed by remote #636

ThomasWaldmann opened this issue Feb 6, 2016 · 33 comments

Comments

@ThomasWaldmann
Copy link
Member

sometimes a borg backup gets interrupted with "connection closed by remote".

it is unclear why this happens or whether this is a problem in borg, in the ssh configuration or on the network layer.

Suspicions:

  • ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
    See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html
  • borg: EINTR and in general exception handling maybe could be improved in remote.py (though it is unclear what would cause EINTR in our case and would not terminate the process anyway). See there for details: http://250bpm.com/blog:12
  • there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)
  • a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?
  • also (trivial), it might be just a unstable network connection that breaks down
@ThomasWaldmann
Copy link
Member Author

Idea for maybe getting more infos from the server side: instead of relying on the log messages that come over the wire (or rather don't come if the connection gets killed), configure the borg on the server to use logging to a local file.

logging.conf: https://paste.thinkmo.de/Nu7rdXvz#logging.conf

export BORG_LOGGING_CONF=/path/to/that/logging.conf
^ that needs to be put into the environment ssh users (like the borg executable) get.
(not sure whether we need a borg wrapper script here or a sshd config change)

@Azelphur
Copy link

Azelphur commented Feb 8, 2016

The borg log sadly doesn't actually contain any useful information, it just says

2016-02-08 19:14:54,477 borg.logger DEBUG using logging configuration read from "/home/azelphur/logging.conf"

Here is the log from the client side, running with BORG_RSH="ssh -v"

# BORG_RSH="ssh -v" /home/azelphur/.borg-backup
Backing up to ssh://azelphur@backup.azelphur.com:23/home/azelphur/darth-vader.borg
Remote: OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015
Remote: debug1: Reading configuration data /etc/ssh/ssh_config
Remote: debug1: /etc/ssh/ssh_config line 19: Applying options for *
Remote: debug1: Connecting to backup.azelphur.com [176.9.24.240] port 23.
Remote: debug1: Connection established.
Remote: debug1: permanently_set_uid: 0/0
Remote: debug1: identity file /root/.ssh/id_rsa type 1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_rsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519 type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519-cert type -1
Remote: debug1: Enabling compatibility mode for protocol 2.0
Remote: debug1: Local version string SSH-2.0-OpenSSH_6.9p1 Ubuntu-2ubuntu0.1
Remote: debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4
Remote: debug1: match: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 pat OpenSSH_6.6.1* compat 0x04000000
Remote: debug1: Authenticating to backup.azelphur.com:23 as 'azelphur'
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: Host '[backup.azelphur.com]:23' is known and matches the ECDSA host key.
Remote: debug1: Found key in /root/.ssh/known_hosts:1
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: debug1: SSH2_MSG_SERVICE_REQUEST sent
Remote: debug1: SSH2_MSG_SERVICE_ACCEPT received
Remote: debug1: Authentications that can continue: publickey,password
Remote: debug1: Next authentication method: publickey
Remote: debug1: Offering RSA public key: /root/.ssh/id_rsa
Remote: debug1: Server accepts key: pkalg ssh-rsa blen 279
Remote: debug1: Authentication succeeded (publickey).
Remote: Authenticated to backup.azelphur.com ([176.9.24.240]:23).
Remote: debug1: channel 0: new [client-session]
Remote: debug1: Requesting no-more-sessions@openssh.com
Remote: debug1: Entering interactive session.
Remote: debug1: Sending environment.
Remote: debug1: Sending env LANG = en_GB.UTF-8
Remote: debug1: Sending command: borg serve --umask=077 --info
reading files cache
processing files
Remote: debug1: need rekeying
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

@ThomasWaldmann
Copy link
Member Author

ok, if that 1 line is all you have in the server log, we at least know it is not the borg process on the server side that encounters some exception or other error - that would be in that log.

so, i guess you have to go through the list of suspicions from my first post, especially the ones relating to ssh itself and the network layer.

another idea is to enhance logging on both sides with a logging.conf and include the timestamps, so we get timing information.

also interesting would be to add another piece of information to the protocol of the experiments:

  • if you do a first backup, there is a lot of traffic going over the network.
  • if you do a 2nd+ backup of same or similar data, there is way less data going over the network
    That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

@Azelphur
Copy link

ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html

To confirm, I do indeed have

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 5
    TCPKeepAlive yes

In my /etc/ssh/ssh_config file, however the issue still occurs.

there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)

also (trivial), it might be just a unstable network connection that breaks down

if you do a 2nd+ backup of same or similar data, there is way less data going over the network That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

To address these questions, I created 10 random files and then tried to backup these files (twice) while running mtr, here's the results

azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample2.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample3.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample4.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample5.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample6.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample7.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample8.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample9.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample0.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ du -sh *
1.1G    sample0.txt
1.1G    sample2.txt
1.1G    sample3.txt
1.1G    sample4.txt
1.1G    sample5.txt
1.1G    sample6.txt
1.1G    sample7.txt
1.1G    sample8.txt
1.1G    sample9.txt
1.1G    sample.txt
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

and the output from mtr

                             My traceroute  [v0.85]
darth-vader (0.0.0.0)                                  Fri Feb 12 16:40:12 2016
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. router.asus.com                   0.0%   411    0.3   0.3   0.2   0.9   0.0
 2. lns12.the.dsl.enta.net            0.0%   411   10.9  13.3   9.0  58.5   7.0
 3. te3-3.the.dist.dsl.enta.net       0.0%   411   11.3  13.7   9.2  91.9   7.9
 4. te2-2.telehouse-east3.dsl.enta.n  0.0%   411   10.0  13.7   9.3 116.0   8.6
 5. te5-6.telehouse-east2.core.enta.  1.2%   411   11.4  13.9   9.7  56.0   6.9
 6. te0-1-0-1.telehouse-east4.core.e  0.0%   411   11.5  13.8  10.1  38.1   6.2
 7. te5-3.amsterdam.core.enta.net     0.0%   411   16.5  19.3  15.3  54.8   6.7
 8. amsix-gw.hetzner.de               0.0%   411   24.7  28.3  24.1  72.4   7.0
 9. core1.hetzner.de                  0.2%   411   24.5  28.0  23.9  65.2   6.9
10. core22.hetzner.de                 0.0%   411   29.4  32.6  28.9  56.3   6.4
11. juniper1.rz15.hetzner.de          0.0%   411   30.2  32.9  28.7  82.8   7.4
12. hos-tr1.ex3k2.rz15.hetzner.de     0.0%   410   30.2  33.7  29.2  66.2   6.4
13. backup.azelphur.com               0.0%   410   29.1  32.2  28.7  62.9   6.1

a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?

My router is an Asus RT-AC87R running AsusWRT-Merlin, DoS protection is disabled and QoS is also off.

@ThomasWaldmann
Copy link
Member Author

@Azelphur just as a weird idea: could you remove these ssh options (both on server and on client side, restart sshd afterwards) and run your random-files test again?

@ThomasWaldmann
Copy link
Member Author

jborg/attic#337
jborg/attic#337 (comment) (increase interval)

http://z9.io/2008/12/10/how-to-fix-ssh-timeout-problems/ (esp. first comment of Kimmo)

@jungle-boogie
Copy link
Contributor

yes, those settings can help but if there's no response, the session can still be terminated. I would also say remove the settings and attempt to leave ssh connected and see what the results are, outside of borg.

@Azelphur
Copy link

Azelphur commented Aug 1, 2016

Ok...here I am, 6 months later. I finally solved it.

There was an issue with my router, I was running an app on the router which does speedtests every 15 minutes. After disabling this app, the problem no longer occurs. Not sure exactly why the issue occurred beyond that, but if anyone else gets this issue and happens to be running some services on their router - disable those services and see if you still get it. :)

@ThomasWaldmann
Copy link
Member Author

@Azelphur thanks for the feedback.

@drzraf
Copy link

drzraf commented Oct 5, 2016

Adding to this topic (unless you prefer a new issue) that killing the ssh/gpg-agent will make borg 1.0.7 create hanging indefinitely (strace shows "select() Timeout") with no warning (at least, in non-verbose mode).
Reproducer: run a borg create then kill the ssh-agent and wait a bit.

@enkore
Copy link
Contributor

enkore commented Oct 5, 2016

That's strange. We tested this with killing the ssh client, which works as expected (ie. terminates with an error message).

@ThomasWaldmann
Copy link
Member Author

@drzraf please give the precise command how you do the kill.

@drzraf
Copy link

drzraf commented Oct 5, 2016

Sorry for the false flag. The timeouts seen were not a sign of hanging (it was indeed uploading): the process was still ongoing. In case these "select() timeout" are of interest anyway here is my command:

BORG_PASSPHRASE=xxx borg create --remote-path=/usr/local/bin/borg1/borg1 --exclude-from xxx --exclude-caches --one-file-system --compression zlib --stats --debug "remote:main-backup::2016-10-04 04:24:03.438923" mydir

(I use such a backup name in order to "resume" from the previous borg create where I initially used {now})

The fact I observed was that remote repository size was not growing after multiple resumed borg create runs.
But the real reason was that it was always interrupted (screenlocker/kill ssh-agent) before the next file (~ 4 GB size) could terminate successfully and since borg can't resume file upload (as of 1.0.7), it was always "removing" this partial file and restarting.
My hypothesis about borg <-> ssh-agent was wrong. Sorry for the noise.

@ThomasWaldmann
Copy link
Member Author

@drzraf you do not need to use a special name to "resume" a past backup. just create a new backup with any name you like and it will use what is already in the repo (so this feels like a resume operation, but technically it is not).

@rugk
Copy link
Contributor

rugk commented Oct 31, 2016

IMHO you cannot prevent this type of error. As one user pointed out it may have to do with the router or - e.g. - with VPNs/proxies whatever...

In any case I would much more like it, if Bork is able to circumvent this issue by automatically retrying the backup (maybe 3 or a configurable number of) times.
This way this error is much less serious and backups should still work.

@acwatkins
Copy link

I am getting this error consistently directly connected to a gigabit switch. Doesn't seem to be happening on the same file though. (I just reran it and it redid a bunch of files and went past the one that it hung on before.

@rugk
Copy link
Contributor

rugk commented Jul 22, 2017

BTW I actually made a small shell wrapper script for borg, which circumvents this issue by just retrying the backup (as I suggested before). See: https://github.com/rugk/borg-cron-helper

@marcelmindemann
Copy link

Thank you @rugk. I am also affected by randomly dropped connections so your script is very helpful. If anyone wants to automate their backups without an additional dependency, the most barebone approach would be:
$ until $(borg create <repository> <files>); do :; done
which restarts the command until borg exits successfully with exit code 0. This makes it impossible to use Ctrl-C to cancel a running backup, so use Ctrl-Z followed by a $ kill %1 (or whatever job id the subshell has).

If you want to control how often to restart the backup or include some exponential backoff time between attempts, look at https://github.com/kadwanev/retry, which is a very small bash retry script that is generally useful.

@mikenye
Copy link
Contributor

mikenye commented Apr 2, 2019

My comment on issue #3988 might also help. Specifically:

In the server's (the machine running borg serve) /etc/ssh/sshd_config file:

ClientAliveInterval 10
ClientAliveCountMax 30

This will cause the server to send a keep alive to the client every 10 seconds. If 30 consecutive keepalives are sent without a response, the connection will be terminated.

If you then run your borg commands with --lock-wait 600, this gives sufficient time for the borg serve processes to exit after the SSH connection is torn down after the 300 second (5 minute) wait for the keepalives to fail.

ThomasWaldmann pushed a commit that referenced this issue Apr 8, 2019
Add "SSH Configuration" section

Add "SSH Configuration" section to "borg serve" documentation, to outline ssh/sshd configuration to prevent borg serve keeping a lock on a repo in the event the ssh connection is abnormally disconnected.

In response to issues #3988, #636 and #4485 (and probably others).
@mikenye
Copy link
Contributor

mikenye commented Apr 10, 2019

The documentation has been updated, I think this issue can now be closed?

@ThomasWaldmann
Copy link
Member Author

yup, thanks!

@akvadrako
Copy link

I am seeing this issue on BorgBase when syncing the indexes of an existing repo - could it be that it isn't configured correctly?

@ThomasWaldmann
Copy link
Member Author

@m3nu ^

@m3nu
Copy link
Contributor

m3nu commented Aug 10, 2019

Thanks for the pointer, @ThomasWaldmann . I did NOT have those values in my configs, but the default, which is:

ClientAliveInterval 0
ClientAliveCountMax 3

Just updated this for current and future setups. Hope this solves the issue @akvadrako is observing.

@akvadrako
Copy link

Thanks @m3nu - it turns out I was seeing this errors while over a VPN. If I see them again I'll let you know.

@abulka
Copy link

abulka commented Jan 31, 2023

I think the actual documentation of the ssh fix referred to here is
https://borgbackup.readthedocs.io/en/stable/usage/serve.html?highlight=lock-wait#examples
which involves

  1. editing the server's /etc/ssh/sshd_config and restarting the ssh service
  2. editing client side’s ~/.ssh/config
  3. modifying the server e.g. borg-backup user's authorized_keys to include the --lock-wait 600 parameter to the borg serve command therein e.g. presumably something like: command="/usr/local/bin/borg serve --lock-wait 600 --restrict-to-path /volume1/BorgBackup/" ssh-ed25519 AAAACxxxxxx...etc

I'm trying out these steps now...

@ThomasWaldmann
Copy link
Member Author

Not sure if enforcing --lock-wait via authorized_keys is a) working and b) is helpful - usually the client shall tell how long it wants to wait for a locked repo to get unlocked.

@abulka
Copy link

abulka commented Feb 1, 2023

Yeah, turns out the ssh tweaks above didn't help me.

Sigh... not sure what to try next - have raised #7313

@panikinator
Copy link

Yeah, turns out the ssh tweaks above didn't help me.

Sigh... not sure what to try next - have raised #7313

Yeah same for me, tried all the above tweaks but raises the "Connection closed by remote host" error after a while no matter what

@m3nu
Copy link
Contributor

m3nu commented Feb 4, 2023

Borgmatic has a retry setting that will just retry the backup. Decrease checkpoint time if you want.

I keep a long FAQ on the issue here: https://docs.borgbase.com/faq/#my-ssh-connection-breaks-after-a-long-backup-or-prune-operation

May give some new ideas.

@ThomasWaldmann
Copy link
Member Author

@m3nu That retry script, does it retry on rc==1? That would be bad...

@m3nu
Copy link
Contributor

m3nu commented Feb 4, 2023

Good point. Borgmatic doesn't, but the general retry script treats it as error and retries 10x. I've added a note on this in the FAQ. Good to mention it.

@panikinator
Copy link

Borgmatic has a retry setting that will just retry the backup. Decrease checkpoint time if you want.

I keep a long FAQ on the issue here: https://docs.borgbase.com/faq/#my-ssh-connection-breaks-after-a-long-backup-or-prune-operation

May give some new ideas.

Thanks a lot! But tbh a bit janky but it's working i guess

I actually ended up setting the tcp keepalive to no coz it gave me better results but my tests were not that precise so i may be wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests