-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
machinefile/ssh improvements #7589
Comments
|
Yes, these are great suggestions. Thanks. |
x count is a bit weird as x would require whitespace without a port. What about: host[count] which mimics array syntax? |
In Julia that's how you index into an array, not how you declare its size, so that seems weird. This would be an option: |
The current syntax is already defined to be:
so it's ambiguous for bind_addr (though arguably you could check if it's an integer). What about:
|
I added an initial version support This is just a stop-gap commit. My plan is to add a |
Would it also be possible to specify the julia bin directory for the remote machines? I frequently have test builds in various directories on my local machine, but on my remotes they're in different directories (usually This is analogous to the Edit: I've created a PR for this: #9347. |
What's the status on these suggestions? cc @amitmurthy |
When specified with a count, we now launch only one process on a remote node, and then launch additional workers via that initial instance. So for a directly accessible cluster there is only one ssh connection per node. Closing the issue. Please reopen if there are any other ssh specific suggestions. |
I was trying Julia recently, however I'm having somewhat an hard time to actually use all the cores that I have access to due to the way the workers are spawned.
First:
Is it somehow possible to enhance the machine file to have a:
host[:count]
format to specify the worker count, instead of repeating the same host 64 times? My current machinefile is a slew of repeated lines. It's actually hard to edit.
Second:
Connecting several times to the same host just to create [n] channels is a bit overkill, especially if we're talking ssh. In fact, I cannot have more than ~20 workers to each host in my case.
I was able to increase the count by using the connection multiplexing in ssh itself (by manually configuring ControlMaster/ControlPath), but I would definitely suggest to add:
-o ControlMaster auto -o ControlPath [path] -o ControlPersist 5
when connecting via ssh, where [path] should be a temporary path unique to the current Julia master process being run, in order to avoid ControlMaster sharing among different master processes. This saves at least [N-1] processes on the remote end (besides eliminating the connection handshakes). ControlPersist could also be removed if the ControlMaster is managed by Julia, as opposed to ssh's "auto" feature.
I would also strongly recommend ssh -T, in any case, to avoid a tty allocation (we don't need one anyway), since this is another issue when requesting a large number of connections via ssh.
So far this should be rather trivial to do, but it's still not optimal. It's quite obvious that running N instances of the same process on the same machine should be done by executing a single copy of
julia --worker -p [n]
, then fork [n] times just after bootstrap to share the initialization/memory/setup and use the same communication channel. Given the current non-negligible startup time of Julia, it would make a big difference.The text was updated successfully, but these errors were encountered: