-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246
Comments
This should also be noted in the docs and/or in the configury test for the minimum hwloc version. |
Isn't that going to be a significant issue for the distros? I believe their defaults are quite a bit older, aren't they? |
This might be hard to do a configury check at the moment. Frontend nodes will likely be vanilla Intel XE, only backends will likely have the Ponte Vecchio accelerators that seem to cause the issue. Where in the docs would you recommend writing a blurb about this? |
Ah, so you want to do a runtime check of the version? I guess we can do that simply enough - would have to be in PMIx so we can cover both mpirun and direct launch modes. Would you mind opening an issue over there so we don't forget? @jsquyres Is this going to be an issue re default hwloc versions on distros (I'm thinking of Amazon here, so @bwbarrett )? I don't know of any other solution, frankly, though I wonder if it wouldn't segfault if we asked it to not include those devices. We can do that if we pass the appropriate flags - @hppritcha has that been tried? |
I guess a runtime check in pmix would be nice, although the segfault was happening inside prte i think. anyway easy workaround at ANL is to use the hwloc they installed in a spack build out. |
@hppritcha Ah, you edited the description of this PR -- I think I understand better now: it's a run-time error with older hwloc on Intel with PV accelerators. In this case, is it easy to add a run-time check to see a) if we're on a machine with Intel PV accelerators, and b) the version of hwloc? If we can detect both of these things at run time, then we should probably show_help() a warning that advises the user of the issue and that they might need to re-build Open MPI with an hwloc >= v2.8. |
Pretty sure there is a "get_version" function in hwloc - will fiddle with it in pmix as that is the base layer that provides the topology. |
K. @hppritcha is there an easy way to tell that we're on a platform with Intel PV accelerators? Perhaps the presence of some file in |
Good point - we don't want to blanket block hwloc versions less than 2.8 for all platforms |
@rhc54 says he knows a way - see openpmix/openpmix#2893 |
Let's ask @bgoglin - is there a way for us to detect that Intel PV accelerators are present on a system prior to using HWLOC? Please see above discussion as to why that is an important question. Any guidance would be appreciated! |
I don't have access to a PV server, I am going to ask. I'd assume we'd need to look for specific PCI vendor:device IDs. |
PCI vendor:device is indeed a good way to identify PV devices, and the list at https://pci-ids.ucw.cz/read/PC/8086 looks correct when it reports device ID = 0x0db[05-9ab] for Ponte Vecchio. However the hwloc bug is related to device with multiple "levelzero subdevices", not to PV specifically, and there are some non-PV devices with multiple subdevices. But those are likely rare for now (discrete GPUs), at least when used in HPC. One solution would be to query level-zero subdevices instead of looking for PCI id. I am not aware of any way to query subdevices from sysfs or anything outside the L0 library. By the way, the hwloc fix (762fdc4bfb8fc33b304511149a532a122ade395f) is basically the only change in 2.7.x after 2.7.1. I can release a 2.7.2 if needed. |
Hmmm...that sounds like we need to simply block everything below 2.7.2? It's an ugly "fix", but I cannot think of any other way to protect against segfault - can you? Adding an L0 dependency is possible, I suppose, but given the history we've had with that library, not something I'm wild about doing. If someone wants to contribute it (it would go in the PMIx pgpu/intel component), I'm willing to look at it. Otherwise, the HWLOC cutoff is the only solution I can see. |
I released hwloc 2.7.2 yesterday. |
Thanks @bgoglin! Have you received any response to the question regarding how we detect that PV is present prior to invoking HWLOC? Requiring hwloc 2.7.2 and above seems pretty onerous - is working thru L0 really the only way to reliably do it? If so, what are the chances of someone providing that code? |
Borrowing liberally from the hwloc code in 2.8.0, what if we did something like this: res = zeInit(0);
if (res != ZE_RESULT_SUCCESS) {
return 0; // we are okay
}
nbdrivers = 0;
res = zeDriverGet(&nbdrivers, NULL);
if (res != ZE_RESULT_SUCCESS || !nbdrivers)
return 0; // we are okay
drh = malloc(nbdrivers * sizeof(*drh));
if (NULL == drh)
return error
res = zeDriverGet(&nbdrivers, drh);
if (res != ZE_RESULT_SUCCESS) {
free(drh);
return error
}
zeidx = 0;
for(i=0; i<nbdrivers; i++) {
uint32_t nbdevices, j;
ze_device_handle_t *dvh;
char buffer[13];
nbdevices = 0;
res = zeDeviceGet(drh[i], &nbdevices, NULL);
if (res != ZE_RESULT_SUCCESS || !nbdevices)
continue;
dvh = malloc(nbdevices * sizeof(*dvh));
if (!dvh)
continue;
res = zeDeviceGet(drh[i], &nbdevices, dvh);
if (res != ZE_RESULT_SUCCESS) {
free(dvh);
continue;
}
for(j=0; j<nbdevices; j++) {
uint32_t nr_subdevices;
nr_subdevices = 0;
res = zeDeviceGetSubDevices(dvh[j], &nr_subdevices, NULL);
/* returns ZE_RESULT_ERROR_INVALID_ARGUMENT if there are no subdevices */
if (res != ZE_RESULT_ERROR_INVALID_ARGUMENT || nr_subdevices > 0) {
// indicates presence of subdevices
// error out with message if hwloc version is less than 2.7.2 |
My reply from 5 days ago is what I got from Intel + ideas from me. Your code above might be good. I don't have PVC access to test it but I have a way to simulate L0 on a couple different Intel GPU servers if you provide a standalone C program. |
I started looking at what it would take to enable this check, and I'm beginning to question if it is worth it. If someone wants to run on a machine that falls into this trap, then they are going to have to use an appropriate HWLOC version. I suppose it is a little nicer if we could warn them of the HWLOC issue instead of just segfaulting, but it would necessitate adding a LevelZero dependency to PMIx that it otherwise doesn't need (at least, so far - and I have no knowledge of any thinking to change that). So I'm wondering if we should just add this to an FAQ somewhere and call it a day? If the user didn't configure PMIx --with-level-zero, then I couldn't warn them of the problem anyway - which feels like a gaping hole in the logic. |
I'd be okay with this solution - documenting somewhere. Maybe someone from Intel might have a different opinion but my "job" here is just to get Open MPI working on Aurora and they have a hwloc 2.8.0 that works fine. |
We discussed this a bit on the PMIx biweekly telecon today. As one participant noted, there are a number of codes that use HWLOC (including the RM) that would also be breaking and quite likely failing prior to PMIx. So there really is no useful purpose served by trying to have PMIx provide a warning. |
Yeah, we really don't want to add anything complicated here -- I was hoping for a simple "if |
Per discussion on the 10 Jan 2023 webex, this has likely turned into a documentation issue. It should be noted in the v5.0.x docs somewhere. Per #11290, we can at least disregard the effects of Open MPI's internal hwloc being too old for v6.x. |
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
accelerators. related to issue open-mpi#11246 Co-authored-by: Jeff Squyres <jsquyres@users.noreply.github.com> Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit c1b5e6e)
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
The hwloc 2.7.1 embedded in main and 5.0.x is too old for Intel processors with Ponte Vecchio accelerators. The lstopo built using this version or older of hwloc segfaults when run on such processors, and prterun experiences a similar segfault when run using the embedded hwloc 2.7.1.
Solution is to use a 2.8 or newer version of hwloc.
The text was updated successfully, but these errors were encountered: