machinefile/ssh improvements #7589

wavexx · 2014-07-13T22:20:10Z

I was trying Julia recently, however I'm having somewhat an hard time to actually use all the cores that I have access to due to the way the workers are spawned.

First:

Is it somehow possible to enhance the machine file to have a:

host[:count]

format to specify the worker count, instead of repeating the same host 64 times? My current machinefile is a slew of repeated lines. It's actually hard to edit.

Second:

Connecting several times to the same host just to create [n] channels is a bit overkill, especially if we're talking ssh. In fact, I cannot have more than ~20 workers to each host in my case.

I was able to increase the count by using the connection multiplexing in ssh itself (by manually configuring ControlMaster/ControlPath), but I would definitely suggest to add:

-o ControlMaster auto -o ControlPath [path] -o ControlPersist 5

when connecting via ssh, where [path] should be a temporary path unique to the current Julia master process being run, in order to avoid ControlMaster sharing among different master processes. This saves at least [N-1] processes on the remote end (besides eliminating the connection handshakes). ControlPersist could also be removed if the ControlMaster is managed by Julia, as opposed to ssh's "auto" feature.

I would also strongly recommend ssh -T, in any case, to avoid a tty allocation (we don't need one anyway), since this is another issue when requesting a large number of connections via ssh.

So far this should be rather trivial to do, but it's still not optimal. It's quite obvious that running N instances of the same process on the same machine should be done by executing a single copy of julia --worker -p [n], then fork [n] times just after bootstrap to share the initialization/memory/setup and use the same communication channel. Given the current non-negligible startup time of Julia, it would make a big difference.

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2014-07-13T22:27:40Z

host:port is used to specify the SSH port number so that syntax proposal clashes. You could do host[:port][ x count] or something like that. The rest of these suggestions all seem like good ideas.

JeffBezanson · 2014-07-14T01:27:55Z

Yes, these are great suggestions. Thanks.

wavexx · 2014-07-14T09:15:10Z

x count is a bit weird as x would require whitespace without a port. What about:

host[count]
host:port[count]

which mimics array syntax?

StefanKarpinski · 2014-07-14T23:44:25Z

In Julia that's how you index into an array, not how you declare its size, so that seems weird. This would be an option: 16 host:1234 or host:1234 16. I don't really see why whitespace is a problem.

wavexx · 2014-07-15T10:42:34Z

The current syntax is already defined to be:

[user@]host[:port] [bind_addr]

so it's ambiguous for bind_addr (though arguably you could check if it's an integer). What about:

[user@]host[:port][*count] [bind_addr]

wavexx · 2014-07-15T13:01:21Z

I added an initial version support *count in the above pull request.

This is just a stop-gap commit. My plan is to add a :n (:count?)argument to Base.addprocs(machines, ...) later, so that I can properly create a shared ssh channel for multiple workers on the same host using ControlPath/ControlMaster, as described above.

sbromberger · 2014-12-13T19:24:46Z

Would it also be possible to specify the julia bin directory for the remote machines? I frequently have test builds in various directories on my local machine, but on my remotes they're in different directories (usually /usr/bin).

This is analogous to the dir kwarg for addprocs.

Edit: I've created a PR for this: #9347.

kshyatt · 2016-09-15T02:35:25Z

What's the status on these suggestions? cc @amitmurthy

amitmurthy · 2016-09-15T03:56:27Z

When specified with a count, we now launch only one process on a remote node, and then launch additional workers via that initial instance. So for a directly accessible cluster there is only one ssh connection per node.

Closing the issue. Please reopen if there are any other ssh specific suggestions.

JeffBezanson added performance and removed performance labels Jul 14, 2014

This was referenced Jul 14, 2014

make julia -p N use fork instead of exec #985

Closed

Add -T -a` to the default ssh command/s. #7599

Merged

wavexx mentioned this issue Jul 15, 2014

Support host count in machinefile #7616

Merged

amitmurthy mentioned this issue Sep 11, 2014

RFC: reworked cluster manager interface #8306

Merged

sbromberger mentioned this issue Dec 13, 2014

added dirs= option to machinefile parsing #9347

Closed

dahlend mentioned this issue Mar 12, 2015

Machinefile nonuniform install locations JuliaLang/Distributed.jl#23

Closed

amitmurthy closed this as completed Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machinefile/ssh improvements #7589

machinefile/ssh improvements #7589

wavexx commented Jul 13, 2014

StefanKarpinski commented Jul 13, 2014

JeffBezanson commented Jul 14, 2014

wavexx commented Jul 14, 2014

StefanKarpinski commented Jul 14, 2014

wavexx commented Jul 15, 2014

wavexx commented Jul 15, 2014

sbromberger commented Dec 13, 2014

kshyatt commented Sep 15, 2016

amitmurthy commented Sep 15, 2016

machinefile/ssh improvements #7589

machinefile/ssh improvements #7589

Comments

wavexx commented Jul 13, 2014

StefanKarpinski commented Jul 13, 2014

JeffBezanson commented Jul 14, 2014

wavexx commented Jul 14, 2014

StefanKarpinski commented Jul 14, 2014

wavexx commented Jul 15, 2014

wavexx commented Jul 15, 2014

sbromberger commented Dec 13, 2014

kshyatt commented Sep 15, 2016

amitmurthy commented Sep 15, 2016