Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machinefile/ssh improvements #7589

Closed
wavexx opened this issue Jul 13, 2014 · 9 comments
Closed

machinefile/ssh improvements #7589

wavexx opened this issue Jul 13, 2014 · 9 comments
Labels
parallelism Parallel or distributed computation

Comments

@wavexx
Copy link
Contributor

wavexx commented Jul 13, 2014

I was trying Julia recently, however I'm having somewhat an hard time to actually use all the cores that I have access to due to the way the workers are spawned.

First:

Is it somehow possible to enhance the machine file to have a:

host[:count]

format to specify the worker count, instead of repeating the same host 64 times? My current machinefile is a slew of repeated lines. It's actually hard to edit.

Second:

Connecting several times to the same host just to create [n] channels is a bit overkill, especially if we're talking ssh. In fact, I cannot have more than ~20 workers to each host in my case.

I was able to increase the count by using the connection multiplexing in ssh itself (by manually configuring ControlMaster/ControlPath), but I would definitely suggest to add:

-o ControlMaster auto -o ControlPath [path] -o ControlPersist 5

when connecting via ssh, where [path] should be a temporary path unique to the current Julia master process being run, in order to avoid ControlMaster sharing among different master processes. This saves at least [N-1] processes on the remote end (besides eliminating the connection handshakes). ControlPersist could also be removed if the ControlMaster is managed by Julia, as opposed to ssh's "auto" feature.

I would also strongly recommend ssh -T, in any case, to avoid a tty allocation (we don't need one anyway), since this is another issue when requesting a large number of connections via ssh.

So far this should be rather trivial to do, but it's still not optimal. It's quite obvious that running N instances of the same process on the same machine should be done by executing a single copy of julia --worker -p [n], then fork [n] times just after bootstrap to share the initialization/memory/setup and use the same communication channel. Given the current non-negligible startup time of Julia, it would make a big difference.

@StefanKarpinski
Copy link
Member

host:port is used to specify the SSH port number so that syntax proposal clashes. You could do host[:port][ x count] or something like that. The rest of these suggestions all seem like good ideas.

@JeffBezanson
Copy link
Member

Yes, these are great suggestions. Thanks.

@wavexx
Copy link
Contributor Author

wavexx commented Jul 14, 2014

x count is a bit weird as x would require whitespace without a port. What about:

host[count]
host:port[count]

which mimics array syntax?

@StefanKarpinski
Copy link
Member

In Julia that's how you index into an array, not how you declare its size, so that seems weird. This would be an option: 16 host:1234 or host:1234 16. I don't really see why whitespace is a problem.

@wavexx
Copy link
Contributor Author

wavexx commented Jul 15, 2014

The current syntax is already defined to be:

[user@]host[:port] [bind_addr]

so it's ambiguous for bind_addr (though arguably you could check if it's an integer). What about:

[user@]host[:port][*count] [bind_addr]

@wavexx
Copy link
Contributor Author

wavexx commented Jul 15, 2014

I added an initial version support *count in the above pull request.

This is just a stop-gap commit. My plan is to add a :n (:count?)argument to Base.addprocs(machines, ...) later, so that I can properly create a shared ssh channel for multiple workers on the same host using ControlPath/ControlMaster, as described above.

@sbromberger
Copy link
Contributor

Would it also be possible to specify the julia bin directory for the remote machines? I frequently have test builds in various directories on my local machine, but on my remotes they're in different directories (usually /usr/bin).

This is analogous to the dir kwarg for addprocs.

Edit: I've created a PR for this: #9347.

@kshyatt
Copy link
Contributor

kshyatt commented Sep 15, 2016

What's the status on these suggestions? cc @amitmurthy

@amitmurthy
Copy link
Contributor

When specified with a count, we now launch only one process on a remote node, and then launch additional workers via that initial instance. So for a directly accessible cluster there is only one ssh connection per node.

Closing the issue. Please reopen if there are any other ssh specific suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

6 participants