Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connecting to windows remote machine with parallelly::makeClusterPSOCK hangs #96

Closed
Tadge-Analytics opened this issue Jan 22, 2023 · 23 comments
Labels
documentation MS_Windows question Further information is requested

Comments

@Tadge-Analytics
Copy link

Tadge-Analytics commented Jan 22, 2023

When I run the following code...
It hangs and doesn't end up connecting.

This is me connecting to another Windows 11 machine.
I have been able to SSH into it.

Any tips?

Here is a video of me troubleshooting with dryrun = TRUE
https://youtu.be/857DRD-k-DA

ip <- "10.0.44.224"

# Path to private SSH key that matches key uploaded to DigitalOcean
ssh_private_key_file <- "/Users/Julian/.ssh/id_rsa"

# Connect and create a cluster
cl <- 
  parallelly::makeClusterPSOCK(
    ip,
    user = "Julian",
    rshopts = c(
      # "-o", "StrictHostKeyChecking=no",
      # "-o", "IdentitiesOnly=yes",
      "-i", ssh_private_key_file
    ),
    master = "10.0.44.234",
    homogeneous = TRUE,

        dryrun = TRUE
    # dryrun = FALSE
    
  )
@HenrikBengtsson
Copy link
Collaborator

Hi. as a starter, make sure you can do what's in Section 'Failing to set up remote workers' of https://parallelly.futureverse.org/reference/makeClusterPSOCK.html.

Thanks for the video, but unfortunately, I cannot really see the text, because I've only got a 14-inch screen, so the fonts are too teeny. But, as a follow up on the above tests, look at the commands outputted when you do dryrun = TRUE. On Linux, it's something like:

'/usr/bin/ssh' -R 11045:10.0.44.234:11045 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224

but you'll get another ssh command if you do this from MS Windows. Make sure you can cut'n'paste the call you see and that get's you into the remote machine. Does that work, or does it stall there? It could be that the reverse tunneling -R causes problems with MS Windows on the other end (just guessing).

@HenrikBengtsson
Copy link
Collaborator

Also, you could set options(parallelly.debug = TRUE) to get more output. It should show where it stalls.

@Tadge-Analytics
Copy link
Author

Thanks for these @HenrikBengtsson I'll give them a quick try

@Tadge-Analytics
Copy link
Author

I did try the 'Failing to set up remote workers" section...
The one-line thing works...

image

I get the following debug output... I'll take a look into it

[15:37:21.214] [local output] makeClusterPSOCK() ...
[15:37:21.214] [local output] Workers: [n = 1] ‘10.0.44.224’
[15:37:21.215] [local output] Base port: 11123
[15:37:21.215] [local output] Getting setup options for 1 cluster nodes ...
[15:37:21.215] [local output]  - Node 1 of 1 ...
[15:37:21.216] [local output] Will search for all 'rshcmd' available

[15:37:21.862] [local output] Found the following available 'rshcmd':
[local output]  1. ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[local output]  2. ‘C:\PROGRA~1\PuTTY\plink.exe’, ‘-ssh’ [type=‘putty-plink’, version=‘plink: Release 0.78; Build platform: 64-bit x86 Windows; Compiler: clang 14.0.0 , emulating Visual Studio 2022 (17.2), _MSC_VER=1932, _MSC_FULL_VER=193231329; Source commit: 4ff82ab29a22936b78510c68f544a99e677efed3’]
[local output]  3. ‘C:\PROGRA~1\RStudio\RESOUR~1\app\bin\MSYS-S~1\ssh.exe’ [type=‘rstudio-ssh’, version=‘OpenSSH_5.4p1, OpenSSL 1.0.0 29 Mar 2010’]
[15:37:21.863] [local output] localMachine=FALSE && 'rshcmd' type is "ssh" => revtunnel=TRUE

[15:37:21.864] [local output] Rscript port: 11123 + 0 = 11123

[15:37:21.865] [local output] Using 'rshcmd': ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[15:37:21.866] [local output] Getting setup options for 1 cluster nodes ... done
[15:37:21.866] [local output] Creating node 1 of 1 ...
[15:37:21.866] [local output] - setting up node
[15:37:21.866] [local output] - attempt #1 of 3
[15:37:21.867] [local output] Starting worker #1 on ‘10.0.44.224’: "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11123:10.0.44.234:11123 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "'C:/PROGRA~1/R/R-42~1.2/bin/x64/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=10.0.44.234 PORT=11123 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
[15:37:22.095] [local output] - Exit code of system() call: 0
[15:37:22.095] [local output] Waiting for worker #1 on ‘10.0.44.224’ to connect back

@Tadge-Analytics
Copy link
Author

When I try to run the code suggested when I do dry run... Interestingly the quotation marks break the connection process... when I take those quotation marks out it connects fine

image

@Tadge-Analytics
Copy link
Author

When I try the full command (first time including the quotation -just to show what I get when I copy and paste verbatum... second time with some quotation mark cleaning up) I get the following:

image

@HenrikBengtsson
Copy link
Collaborator

Can you paste in as plain text instead of screenshots? Then I can cut'n'paste from it

@Tadge-Analytics
Copy link
Author

Sure thing. Just let me know which ones you need.
Here is what dryrun instructs me to paste in...

  "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11750:10.0.44.234:11750 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "'C:/PROGRA~1/R/R-42~1.2/bin/x64/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=10.0.44.234 PORT=11750 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Jan 22, 2023

AFAIU, you said:

C:\WINDOWS\System32\OpenSSH\ssh.exe -R 11750:10.0.44.234:11750 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224

workers, but then, in your last screenshot, you got an error:

C:\WINDOWS\System32\OpenSSH\ssh.exe -R 11750:10.0.44.234:11750 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 'C:/PROGRA~1/R/R-42~1.2/bin/x64/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=10.0.44.234 PORT=11750 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
The system cannot find file specified.

So, maybe it cannot find 'C:/PROGRA~1/R/R-42~1.2/bin/x64/Rscript' on that other machine. I'm a bit puzzled why it does not use 'Rscript' here. Oh... wait, I see you're using homogeneous = TRUE. Remove that one (or explicitly set it to homogeneous = FALSE).

And now I also see that you're setting master = 10.0.44.234, which explains why it shows up in the reverse tunnel option and MASTER=10.0.44.234; that causes the workers to try to connect back via that IP number, which requires setting up port-forwarding in the firewall, instead of the reverse SSH tunnel that is always on localhost and that circumvents the firewall.

So, retry with:

cl <- parallelly::makeClusterPSOCK("10.0.44.224", user = "Julian",
           rshopts = c("-i", ssh_private_key_file), dryrun = TRUE)

and see if the proposed full call works, i.e. doesn't give that error.

@Tadge-Analytics
Copy link
Author

giving it a shot now...
As for the options I've chosen (to date)... I'm just piecing it together from what I've seen around a few other blog posts... totally making it up as I go alon

@Tadge-Analytics
Copy link
Author

This is the output I get running your command

> cl <- parallelly::makeClusterPSOCK("10.0.44.224", user = "Julian",
+            rshopts = c("-i", ssh_private_key_file), dryrun = TRUE)
[16:10:22.976] [local output] makeClusterPSOCK() ...
[16:10:22.977] [local output] Workers: [n = 1] ‘10.0.44.224’
[16:10:22.978] [local output] Base port: 11041
[16:10:22.978] [local output] Getting setup options for 1 cluster nodes ...
[16:10:22.978] [local output]  - Node 1 of 1 ...
[16:10:22.979] [local output] Will search for all 'rshcmd' available

[16:10:23.669] [local output] Found the following available 'rshcmd':
[local output]  1. ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[local output]  2. ‘C:\PROGRA~1\PuTTY\plink.exe’, ‘-ssh’ [type=‘putty-plink’, version=‘plink: Release 0.78; Build platform: 64-bit x86 Windows; Compiler: clang 14.0.0 , emulating Visual Studio 2022 (17.2), _MSC_VER=1932, _MSC_FULL_VER=193231329; Source commit: 4ff82ab29a22936b78510c68f544a99e677efed3’]
[local output]  3. ‘C:\PROGRA~1\RStudio\RESOUR~1\app\bin\MSYS-S~1\ssh.exe’ [type=‘rstudio-ssh’, version=‘OpenSSH_5.4p1, OpenSSL 1.0.0 29 Mar 2010’]
[16:10:23.670] [local output] localMachine=FALSE && 'rshcmd' type is "ssh" => revtunnel=TRUE

[16:10:23.671] [local output] Rscript port: 11041 + 0 = 11041

[16:10:23.672] [local output] Using 'rshcmd': ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[16:10:23.672] [local output] Getting setup options for 1 cluster nodes ... done
[16:10:23.673] [local output] Creating node 1 of 1 ...
[16:10:23.673] [local output] - setting up node
[16:10:23.673] [local output] - attempt #1 of 3
----------------------------------------------------------------------
Manually, (i) login into external machine ‘10.0.44.224’:

  "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11041:127.0.0.1:11041 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224

and (ii) start worker #1 from there:

  'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11041 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Alternatively, start worker #1 from the local machine by combining both steps in a single call:

  "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11041:127.0.0.1:11041 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11041 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

[16:10:23.673] [local output] Creating node 1 of 1 ... done
[16:10:23.674] [local output] Launching of workers completed
[16:10:23.674] [local output] Collecting session information from workers
[16:10:23.674] [local output] makeClusterPSOCK() ... done

@HenrikBengtsson
Copy link
Collaborator

As for the options I've chosen (to date)... I'm just piecing it together from what I've seen around a few other blog posts... totally making it up as I go alon

If any of them use parallelly, and not just parallel, let me know which they are, so I can make sure they're not giving incorrect instructions/suggestions.

@Tadge-Analytics
Copy link
Author

When I try with dryrun = FALSE

> cl <- parallelly::makeClusterPSOCK("10.0.44.224", user = "Julian",
+            rshopts = c("-i", ssh_private_key_file), dryrun = FALSE)
[16:10:35.743] [local output] makeClusterPSOCK() ...
[16:10:35.743] [local output] Workers: [n = 1] ‘10.0.44.224’
[16:10:35.744] [local output] Base port: 11646
[16:10:35.744] [local output] Getting setup options for 1 cluster nodes ...
[16:10:35.745] [local output]  - Node 1 of 1 ...
[16:10:35.745] [local output] Will search for all 'rshcmd' available

[16:10:36.417] [local output] Found the following available 'rshcmd':
[local output]  1. ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[local output]  2. ‘C:\PROGRA~1\PuTTY\plink.exe’, ‘-ssh’ [type=‘putty-plink’, version=‘plink: Release 0.78; Build platform: 64-bit x86 Windows; Compiler: clang 14.0.0 , emulating Visual Studio 2022 (17.2), _MSC_VER=1932, _MSC_FULL_VER=193231329; Source commit: 4ff82ab29a22936b78510c68f544a99e677efed3’]
[local output]  3. ‘C:\PROGRA~1\RStudio\RESOUR~1\app\bin\MSYS-S~1\ssh.exe’ [type=‘rstudio-ssh’, version=‘OpenSSH_5.4p1, OpenSSL 1.0.0 29 Mar 2010’]
[16:10:36.417] [local output] localMachine=FALSE && 'rshcmd' type is "ssh" => revtunnel=TRUE

[16:10:36.419] [local output] Rscript port: 11646 + 0 = 11646

[16:10:36.419] [local output] Using 'rshcmd': ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[16:10:36.420] [local output] Getting setup options for 1 cluster nodes ... done
[16:10:36.420] [local output] Creating node 1 of 1 ...
[16:10:36.420] [local output] - setting up node
[16:10:36.420] [local output] - attempt #1 of 3
[16:10:36.420] [local output] Starting worker #1 on ‘10.0.44.224’: "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11646:127.0.0.1:11646 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11646 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
[16:10:36.641] [local output] - Exit code of system() call: 0
[16:10:36.642] [local output] Waiting for worker #1 on ‘10.0.44.224’ to connect back
Failed to launch and connect to R worker on remote machine ‘10.0.44.224’ from local machine ‘Z13-THINKPAD’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11646 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11646:127.0.0.1:11646 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11646 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential".
 * Troubleshooting suggestions:
   - Suggestion #1: On Windows, output from worker when using 'outfile=NULL' is only visible when running R from a terminal (not a GUI).
   - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’.
   - Suggestion #3: The 'rshcmd' (‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]) used may not support reverse tunneling (revtunnel = TRUE). See ?parallelly::makeClusterPSOCK for alternatives.


[16:12:36.684] [local output] - waiting 15 seconds before trying again

@Tadge-Analytics
Copy link
Author

Also, happy to do a quick Zoom session -might be helpful to see what's going on in real time

@HenrikBengtsson
Copy link
Collaborator

Try adding argument rscript_sh = "cmd"

@Tadge-Analytics
Copy link
Author

oh my! 😁

> cl <- parallelly::makeClusterPSOCK("10.0.44.224", user = "Julian",
+            rshopts = c("-i", ssh_private_key_file), 
+            rscript_sh = "cmd",
+            dryrun = FALSE)
[16:22:28.899] [local output] makeClusterPSOCK() ...
[16:22:28.900] [local output] Workers: [n = 1] ‘10.0.44.224’
[16:22:28.900] [local output] Base port: 11920
[16:22:28.901] [local output] Getting setup options for 1 cluster nodes ...
[16:22:28.901] [local output]  - Node 1 of 1 ...
[16:22:28.902] [local output] Will search for all 'rshcmd' available

[16:22:29.584] [local output] Found the following available 'rshcmd':
[local output]  1. ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[local output]  2. ‘C:\PROGRA~1\PuTTY\plink.exe’, ‘-ssh’ [type=‘putty-plink’, version=‘plink: Release 0.78; Build platform: 64-bit x86 Windows; Compiler: clang 14.0.0 , emulating Visual Studio 2022 (17.2), _MSC_VER=1932, _MSC_FULL_VER=193231329; Source commit: 4ff82ab29a22936b78510c68f544a99e677efed3’]
[local output]  3. ‘C:\PROGRA~1\RStudio\RESOUR~1\app\bin\MSYS-S~1\ssh.exe’ [type=‘rstudio-ssh’, version=‘OpenSSH_5.4p1, OpenSSL 1.0.0 29 Mar 2010’]
[16:22:29.585] [local output] localMachine=FALSE && 'rshcmd' type is "ssh" => revtunnel=TRUE

[16:22:29.586] [local output] Rscript port: 11920 + 0 = 11920

[16:22:29.587] [local output] Using 'rshcmd': ‘C:\WINDOWS\System32\OpenSSH\ssh.exe’ [type=‘ssh’, version=‘OpenSSH_for_Windows_8.6p1, LibreSSL 3.4.3’]
[16:22:29.587] [local output] Getting setup options for 1 cluster nodes ... done
[16:22:29.588] [local output] Creating node 1 of 1 ...
[16:22:29.588] [local output] - setting up node
[16:22:29.588] [local output] - attempt #1 of 3
[16:22:29.588] [local output] Starting worker #1 on ‘10.0.44.224’: "C:\WINDOWS\System32\OpenSSH\ssh.exe" -R 11920:127.0.0.1:11920 -l Julian -i /Users/Julian/.ssh/id_rsa 10.0.44.224 "\"Rscript\" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e \"options(socketOptions = \\\"no-delay\\\")\" -e \"workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()\" MASTER=localhost PORT=11920 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
[16:22:29.809] [local output] - Exit code of system() call: 0
[16:22:29.809] [local output] Waiting for worker #1 on ‘10.0.44.224’ to connect back
[16:22:30.558] [local output] Connection with worker #1 on ‘10.0.44.224’ established
[16:22:30.559] [local output] Creating node 1 of 1 ... done
[16:22:30.559] [local output] Launching of workers completed
[16:22:30.560] [local output] Collecting session information from workers
[16:22:30.567] [local output]  - Worker #1 of 1
[16:22:30.567] [local output] makeClusterPSOCK() ... done

@Tadge-Analytics
Copy link
Author

seems to have worked? Let me give it a quick test with the process I was using...

@Tadge-Analytics
Copy link
Author

Perfect! Works like a charm!
Thanks so much for your help @HenrikBengtsson
Totally this opens some great doors for a client I'm assisting

HenrikBengtsson added a commit that referenced this issue Jan 22, 2023
… is needed if the remote machines run MS Windows [#96]
@Tadge-Analytics
Copy link
Author

And just as a follow up... for setting up the Windows machine to access SSH connection using SSH keys... I used the following youtube video... This was really helpful as a fundamental step that I hadn't seen explained anywhere else.
https://www.youtube.com/watch?v=9dhQIa8fAXU

@HenrikBengtsson
Copy link
Collaborator

Happy to hear. Straight to the poolroom? :p

PS. I've added an example for using remote MS Windows machines to https://parallelly.futureverse.org/reference/makeClusterPSOCK.html.

@Tadge-Analytics
Copy link
Author

Tadge-Analytics commented Jan 22, 2023 via email

@HenrikBengtsson
Copy link
Collaborator

for setting up the Windows machine to access SSH connection using SSH keys ... youtube video ...

I guess the world is not meant for 14-inch screens anymore 🤷 Oh well... so, getting up and running with SSH keys is tricky the first time, if you've never done it before, but as soon as you know how it should work, it's standard practice. It's been that way on Unix/Linux since ... the 90's. Now, the next step is to learn set more SSH defaults in ~/.ssh/config - that you way don't have to specify the key option or the user name. See https://wynton.ucsf.edu/hpc/howto/log-in-without-pwd.html#step-4-avoid-having-to-specify-ssh-option--i-on-local-machine for an example.

@Tadge-Analytics
Copy link
Author

Tadge-Analytics commented Feb 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation MS_Windows question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants