-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per the discussion on the telecon, change the -host behavior yet again #1353
Conversation
@rhc54 I'm not sure the right things are happening. I have 2 hosts, mpi009,mpi010 -- with 16 and 24 cores, respectively.
|
The more I think about the ??? case from my prior comment, the more I think it should be GOOD. Specifically: it makes the That being said, it does make it weird that |
@jsquyres and I discussed this on the phone and agreed that fixing the oversubscribed issue is the only change that is required. |
@jsquyres Okay, I believe this now fits the agreed-upon behavior. Please verify and commit |
Hmm. It's mostly what I expected: $ mpirun -np --host mpi009 hostname
mpi009
# This is a 16 core machine
$ mpirun -np 16 --host mpi009 hostname
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
mpi009
# There are 16 cores, so this should error
$ mpirun -np 17 --host mpi009 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 17 slots
that were requested by the application:
hostname
Either request fewer slots for your application, or make more slots available
for use.
$ echo $status
1 However, I thought we agreed that $ mpirun --host mpi009,mpi010 hostname
mpi009
mpi010
$ mpirun --host mpi009,mpi010,mpi011 hostname
mpi009
mpi010
mpi011 That's what it did in my test above, at least (run N*num_cores copies, not N copies). It's subjective, I suppose -- either way could be "correct". What did we do in previous versions? |
I honestly don't remember - how about we declare it "good enough"? All these corner case conditions are already making the code pretty complex and I'd hate to keep adding to it. |
@jsquyres okay, I surrender - updated to address your last comment. |
Be aware that I made that last comment based on our last phone conversation where we discussed what OMPI did in prior versions. ...but I think both our memories of prior OMPI versions were faulty (which should probably be no surprise!). I ran tests this morning going back to v1.6.0 to see what we've been doing with It turns out that this PR is currently changing a lot of existing behavior -- probably much more than we want it to. @rhc54 I know you're totally sick of this 😦 but would it be ok if we talk through the spreadsheet and come up with some minimal change compared to v1.10.2? Sorry!! |
Oops -- the google spreadsheet is now publicly accessible. |
thx - just to be clear: I am in no way proposing that this come over to the 1.10 series. This is a PR for master only. For 1.10, I'll just backport the change that set the exit status. I see no reason to change the behavior of -host in that series. |
@jsquyres okay, i looked at the spreadsheet. I honestly don't have time to address all those proposed changes yet again. The logic is getting impenetrable to keep handling all the edge cases, and I'm out of time. So this isn't happening in the near future. |
@rhc54 Ok, fair enough. Just to be clear, though, I was trying to say that it might be easier to ditch the PR, because it seems to have gone too far. The actual desired changes (compared to master and v2.x) are actually much smaller. I added 2 columns at the end to show current git master and current v2.x behavior. These hopefully make it clearer that we're actually quite close to the desired behavior. That being said:
I'm mostly concerned about the fact that we're basically introducing new flip-flopping
|
which is why I was sighing at the very beginning of this mess...frankly, this entire -host/-hostfile behavior thing has been a nightmare from the beginning as the community cannot seem to consistently agree on what it should do |
We're going to discuss this at the Dallas developer meeting next week. |
Per discussion in Dallas, for 2.0.0:
|
@jsquyres I think this follows what we decided - please check. |
@jsquyres NOTE: a custom patch will be required for v2.x - this change will not cleanly apply |
It's very, very close. I only found one discrepancy:
|
So I assume you have less than 1000 entries in that hostfile, none of which have specified slots, yes? And the totality of the number of cores across all entries is less than 1000? |
Correct: I have 2 nodes listed in my hostfile, and |
bot:retest |
@rhc54 I'm sorry, but the behavior with these latest commits changed quite a bit compared to my tests yesterday afternoon. 😦
...I didn't test other combinations, because I think these changes mean that some unintended changes crept in...? FYI: I'm going off column J in the google spreadsheet as what we decided in Dallas. Please correct me if that's wrong... |
I quit - this will have to wait. I would suggest unblocking 2.0 for it. |
Okay. We should document differences from 1.10. I'll open an issue to track. |
I revised the PR description to more accurately reflect where we wound up: Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row 7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. |
Per conversation on the call today (29 Mar 2016), we updated column J a bit in the spreadsheet. The goal is to make
|
…ly run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up. Fixes #1344
Let's bring this in so folks can decide if they like it or not. |
Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements
Ensure the returned exit status is non-zero if we fail to map
If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported.
If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules.
If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect.
With that one correction, this now passes all the given use-cases on that spreadsheet.
Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up.
Fixes #1344