Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External PMIx server support is broken #1425

Closed
artpol84 opened this issue Mar 4, 2016 · 9 comments
Closed

External PMIx server support is broken #1425

artpol84 opened this issue Mar 4, 2016 · 9 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2016

@ggouaillardet @rhc54 @jladd-mlnx
The following commit:
b55b9e6
Brakes launch through the SLURM plugin the assert:
https://github.com/open-mpi/ompi/blob/master/ompi/proc/proc.h#L399
is triggered because jobid was created from hash function:
https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/pmix112/pmix1_client.c#L122

@artpol84
Copy link
Contributor Author

artpol84 commented Mar 4, 2016

P.S. if I negate the 0x8000 bit all works fine:

OPAL_HASH_STR(my_proc.nspace, pname.jobid);                                                                                                   
pname.jobid &= ~(0x8000);

@ggouaillardet
Copy link
Contributor

@artpol84 that is the right fix for me.

@artpol84
Copy link
Contributor Author

artpol84 commented Mar 4, 2016

@ggouaillardet Thank you,
Lets see what Ralph will say. And if everybody OK I'll PR.

@artpol84
Copy link
Contributor Author

artpol84 commented Mar 4, 2016

I think that the right solution would be to hash to 32 - 1 = 31 bit and then form the jobid so that bit 0x8000 is not used.
Otherwise we can weaken the hash.

@ggouaillardet
Copy link
Contributor

right,
this is cryptology and I have no such skills ...

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2016

I don't think it really matters, to be honest - SLURM doesn't support cross-job connections anyway, and so the jobid is totally arbitrary (we could just set it to '1', I suppose). One alternative we could use would be to have pmix_init check for an envar setting PMIX_JOBID, and then we could atoi that string. Would that help?

@artpol84
Copy link
Contributor Author

artpol84 commented Mar 4, 2016

SLURM may support multi job in future, so it'll be grate to consider that.
I was thinking about using PMIx_Get(PMIX_JOBID) to get the jobid. I think it is more portable way.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2016

We would need to add that to the list of things required to be provided in the environment, but that might be okay - basically, we want an int32_t that corresponds to the nspace string? Also has to be unique across the potential connection space, which could be tricky to guarantee...which is why we went to the string nspace anyway.

I wonder if we should just worry about a temporary fix here? The plan is for orte to move to using the nspace instead of an int jobid anyway, so this may not be a longterm problem.

@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2016

Per the 2016-03-08 call, @rhc54 will fix this. He thinks it's a trivial fix (even if it's a short-term fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants