-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterManager should use dispatch instead of function pointers #8168
Comments
I don't see a significant need to change this. |
I think the interface here should be revisited in any case. At the moment, the cluster manager's |
+1 for revisiting this. Maybe it's feasible to integrate something I've been asking for on the ClusterManagers.jl repo, i.e. that there should be room for a batch version of the cluster manager (or at least some guidance). on most large systems it's infeasible to run a job interactively, so I think it would be desirable to have some kind of batch manager. That means you first get a list of compute node names from a scheduler, then launch a master julia which takes that list and launches on all nodes with |
That is essentially what ClusterManagers does. If you allocate the nodes on your own outside of julia and then pass those to julia at startup, ClusterManagers shouldn't be required. This is fine for batch usage, but for interactive usage to connect the head node to the allocated clusters, some more setup is perhaps required. Cc: @amitmurthy |
Yes. I guess i'm talking about something that does the part
I guess you refer to
If instead i launch one julia master, read and parse |
The above problem could probably be solved by tweaking From the man page:
So, on my machine, with 10 concurrent unauthenticated connections (i.e., ssh setup is in progress), further connections will be refused with a probability rate of 30%. You could try tweaking the above parameter, or alternatively, limiting each unique hostname in $PBS_NODEFILE to 10. As an example, you could look at |
I am wondering if this should be the default behavior of #7616 should be enhanced to support this arg too. |
Gotcha. That must be the reason we my manual approach of looping over the On Monday, 1 September 2014, Amit Murthy notifications@github.com wrote:
|
I'm glad this has spurred a useful discussion. Would somebody like to open a new issue for However this is pretty far from the original topic of this issue. @amitmurthy how do you feel about the proposed change? |
Will submit a PR addressing @floswald issue in the next couple of days. In principle, I agree with revisiting the ClusterManager interface. Other than the multiple dispatch change, separating the "launch" and "create_worker" into two concurrent tasks will also address @simonster 's issue w.r.t the delayed "create_worker" causing some workers to prematurely exit. |
fixed by #8306 |
ClusterManager is defined like this:
This uses function pointers instead of multiple dispatch. There does not seem to be a need for this flexibility -- cluster manager objects do not modify these pointers once they are created.
This should probably read like this instead:
Similarly for SSHManager.
The text was updated successfully, but these errors were encountered: