-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
addprocs problem with multiple nodes in 0.6.1 #24722
Comments
Can you build the |
I think the backport of this #21818 onto 0.6 may be the cause of this behavior. The workers are now listening on a system selected ephemeral port which may not be accessible from the master node. Does |
@ararslan Thank you I will try it. |
Thanks. Please test with all ports open between all nodes in the cluster as the workers connect to each other too. |
@amitmurthy Thank you for your comments.
|
That is fine. The address in use is expected with the way you tested above - 2 workers cannot both bind to 9009 on the same host . Can you test by opening all ports between all nodes of the cluster(and master) and a regular addprocs? |
I don't have permission for the network right now, so I created a cluster on AWS.
...and it worked fine. One of my co-workers said the problem could be FreeIPA in our cluster. |
No idea about FreeIPA. I am wondering if cluster setups usually block connections to the ephemeral port range. If so, we should address the issue on master too. |
https://discourse.julialang.org/t/addprocs-with-ssh-does-not-work-on-0-6-1/7253/3 I use IPA system and all of my remotehosts are connected via IPA. |
And I tried
and
Both cases create errors in version 0.6.1. |
It does appear that the cluster environments in question block connections to ports in the ephemeral port range. Can you check with your sysadmin? Or you could try the following: With Julia 0.6.1: On one terminal open a ssh session to
In another (local) terminal, try connecting to the port printed above (in my case it was 53352, will be different for you)
It should fail. Repeat the same exercise with the listen port changed to 9009. It should work. |
See #24722 (comment) for the cause. We are planning to revert this behavior in 0.6.2 |
Amit's fix for this has now been incorporated into my backport branch. It would be great if you could build |
I tried to build |
The error occurs with your code on
So I changed the port number to
Yet |
@alkorang Try this test binary. That's a generic Linux build of my backport branch. Note: That binary is NOT intended for general use. It is for testing purposes ONLY. |
Sorry, the code block should be
i.e., remove the additional |
The same error when I opened this issue occurs. |
@amitmurthy With a random port,
With port
|
Reverted in 0.6.2 |
I setup cluster with multiple nodes and I works perfectly with 0.6.0 version, but not with 0.6.1 version.
First I tried with default option, it does not worked. So I tried with
tunnel=true
option, which make it possible to connect one node, but not multiple nodes at once.So I tried the same with 0.6.0 version, and it worked perfectly.
The text was updated successfully, but these errors were encountered: