Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native parallel client scp_send unexpected timeouts #271

Closed
zephyrj opened this issue Jan 6, 2021 · 3 comments · Fixed by #272
Closed

Native parallel client scp_send unexpected timeouts #271

zephyrj opened this issue Jan 6, 2021 · 3 comments · Fixed by #272
Labels

Comments

@zephyrj
Copy link

zephyrj commented Jan 6, 2021

When a timeout value is set on a parallel-ssh client it appears that this timeout is being used as the max amount of time to wait for scp_send operations to complete to all nodes. When sending a large file to multiple hosts I have started seeing an error:

('Error opening remote file %s for writing on host %s - %s', 'big_file', u'host', Timeout())

I construct the ParallelSSHClient object like this:

ParallelSSHClient(self.m_hosts, user=self.username, password=password, pkey=key, keepalive_seconds=30,
                  timeout=10, 
                  num_retries=1, 
                  retry_delay=1, 
                  pool_size=100 if len(self.m_hosts) < 50 else len(self.m_hosts) * 2)

The timeouts always occur 10 secs after starting the scp_send operation and using join on it:

transfer_sessions = self.m_sshClient.scp_send(local_file_path, remote_file_path)
gevent.joinall(transfer_sessions, raise_error=True)

I have been using version 2.3.2 but have verified that the code in this area hasn't changed on the latest version. I like to try it on the latest parallel-ssh version but as the environment I am working in is using python2 (much to my disdain) and I have no control over this I am unable to. I'm hoping this report helps with future improvements.

I did a bit of digging into the code and found the block which is raising it in pssh/clients/native/parallel.py:

def _scp_send(self, local_file, remote_file):
    fileinfo = os.stat(local_file)
    try:
        chan = self._eagain(
	    self.session.scp_send64,
	    remote_file, fileinfo.st_mode & 0o777, fileinfo.st_size,
	    fileinfo.st_mtime, fileinfo.st_atime)
    except Exception as ex:
        msg = "Error opening remote file %s for writing on host %s - %s"
        logger.error(msg, remote_file, self.host, ex)
        raise SCPError(msg, remote_file, self.host, ex)

The self._eagain call is doing this:

def _eagain(self, func, *args, **kwargs):
    timeout = kwargs.pop('timeout', self.timeout)
    with GTimeout(seconds=timeout, exception=Timeout):
        ret = func(*args, **kwargs)
	while ret == LIBSSH2_ERROR_EAGAIN:
	    self.poll()
	    ret = func(*args, **kwargs)
	return ret

It is using self.timeout to set the timeout for this operation and in my case it is set to a value of 10 (from my constructor args). I found it unlikely that this opening of the remote file would take 10 secs on a regular basis so I tried adding some debug logging to a local version of the source and I found this:

Opened remote file on host0 in 0.0152170658112
Finished sending file data on host0 in 7.17523002625
Opened remote file on host1 in 7.19069314003
Finished sending file data on host1 in 7.42101097107
Opened remote file on host3 in 14.6122598648
Finished sending file data on host3 in 7.85037207603
Opened remote file on host2 in 22.4626410007
Finished sending file data on host2 in 7.75936889648

It appears as though the opening of each remote file (in the scp_send64 call) is being queued up behind the actual copying of the files meaning that this timeout value effectively becomes a "maximum amount of time to wait for the copy to complete at all hosts"

I'm not too familiar with greenlets but my understanding is that these operations should be being performed co-operatively so I guess that the self.eagain_write(chan.write, data) call afterwards never yields when performing the write operation. For my own learning I added a self.poll() call between the scp_send64 and self.eagain_write(chan.write, data) calls and this shows the behaviour I would have expected - the copy is allowed to complete, without error, even if it takes > 10 secs:

Opened remote file on host1 in 0.0519061088562
Opened remote file on host0 in 0.0521171092987
Opened remote file on host2 in 0.0541279315948
Opened remote file on host3 in 0.0543429851532
Finished sending file data on host1 in 3.092856884
Finished sending file data on host0 in 6.02891302109
Finished sending file data on host2 in 9.43797397614
Finished sending file data on host3 in 12.3629457951

To Reproduce
Create a ParallelSSHClient with a small timeout value to multiple (the more the better) hosts:

ParallelSSHClient(<multiple-hosts>, 
                  user=username, 
	          password=password,
                  timeout=3)

Try to send a large (500MB+) file to all of these hosts - the important part is that the sending of the file should take longer than timeout set in the client:

transfer_sessions = client.scp_send(local_file_path, remote_file_path)
gevent.joinall(transfer_sessions, raise_error=True)

An error will be raised by the joinall call:

('Error opening remote file %s for writing on host %s - %s', 'big_file', u'host', Timeout())

Expected behavior
The timeout value set in the client should only apply to the opening of the remote file opening - not the transfer of the file contents.

Additional info
parallel-ssh: Version: 2.3.2
python version: Python 2.7.5

I'm embarassed by the old, unsupported conditions I'm working with but I hope this can be addressed in the latest code versions and I can try to convince someone that we should move to python3 :)

@zephyrj
Copy link
Author

zephyrj commented Jan 6, 2021

The patch to add the debug logging mentioned above:
Add_debug_logging_patch.txt

The patch adding in the poll call:
allow_greenlet_poll_patch.txt

@pkittenis pkittenis mentioned this issue Jan 9, 2021
@pkittenis
Copy link
Member

pkittenis commented Jan 9, 2021

Hi there,

Thank you for the interest and detailed report.

Yes, it looks like writing can stop other greenlets from executing with a large enough write, meaning not just scp. Have made some changes to resolve.

Re: python 2 - later versions will probably work, though not supported, like python 2.

@pkittenis
Copy link
Member

pkittenis commented Jan 9, 2021

2.5.3 resolves.

@pkittenis pkittenis added the bug label Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants