embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

hppritcha · 2022-12-23T15:37:03Z

The hwloc 2.7.1 embedded in main and 5.0.x is too old for Intel processors with Ponte Vecchio accelerators. The lstopo built using this version or older of hwloc segfaults when run on such processors, and prterun experiences a similar segfault when run using the embedded hwloc 2.7.1.

Solution is to use a 2.8 or newer version of hwloc.

jsquyres · 2022-12-23T15:38:59Z

This should also be noted in the docs and/or in the configury test for the minimum hwloc version.

rhc54 · 2022-12-23T15:59:51Z

Isn't that going to be a significant issue for the distros? I believe their defaults are quite a bit older, aren't they?

hppritcha · 2022-12-23T19:19:47Z

This might be hard to do a configury check at the moment. Frontend nodes will likely be vanilla Intel XE, only backends will likely have the Ponte Vecchio accelerators that seem to cause the issue.

Where in the docs would you recommend writing a blurb about this?

rhc54 · 2022-12-23T19:28:45Z

Ah, so you want to do a runtime check of the version? I guess we can do that simply enough - would have to be in PMIx so we can cover both mpirun and direct launch modes. Would you mind opening an issue over there so we don't forget?

@jsquyres Is this going to be an issue re default hwloc versions on distros (I'm thinking of Amazon here, so @bwbarrett )? I don't know of any other solution, frankly, though I wonder if it wouldn't segfault if we asked it to not include those devices. We can do that if we pass the appropriate flags - @hppritcha has that been tried?

hppritcha · 2022-12-23T19:38:23Z

I guess a runtime check in pmix would be nice, although the segfault was happening inside prte i think. anyway easy workaround at ANL is to use the hwloc they installed in a spack build out.

jsquyres · 2022-12-23T20:29:20Z

@hppritcha Ah, you edited the description of this PR -- I think I understand better now: it's a run-time error with older hwloc on Intel with PV accelerators. In this case, is it easy to add a run-time check to see a) if we're on a machine with Intel PV accelerators, and b) the version of hwloc?

If we can detect both of these things at run time, then we should probably show_help() a warning that advises the user of the issue and that they might need to re-build Open MPI with an hwloc >= v2.8.

rhc54 · 2022-12-23T20:33:45Z

Pretty sure there is a "get_version" function in hwloc - will fiddle with it in pmix as that is the base layer that provides the topology.

jsquyres · 2022-12-23T20:38:08Z

Pretty sure there is a "get_version" function in hwloc - will fiddle with it in pmix as that is the base layer that provides the topology.

K. @hppritcha is there an easy way to tell that we're on a platform with Intel PV accelerators? Perhaps the presence of some file in /sys or something?

rhc54 · 2022-12-23T20:58:05Z

Good point - we don't want to blanket block hwloc versions less than 2.8 for all platforms

hppritcha · 2022-12-23T21:04:13Z

@rhc54 says he knows a way - see openpmix/openpmix#2893
nothing obvious from lspci or kernel modules that jumps out to me.

rhc54 · 2022-12-24T08:21:04Z

Let's ask @bgoglin - is there a way for us to detect that Intel PV accelerators are present on a system prior to using HWLOC? Please see above discussion as to why that is an important question. Any guidance would be appreciated!

bgoglin · 2022-12-25T10:24:13Z

I don't have access to a PV server, I am going to ask. I'd assume we'd need to look for specific PCI vendor:device IDs.

bgoglin · 2022-12-30T22:12:18Z

PCI vendor:device is indeed a good way to identify PV devices, and the list at https://pci-ids.ucw.cz/read/PC/8086 looks correct when it reports device ID = 0x0db[05-9ab] for Ponte Vecchio.

However the hwloc bug is related to device with multiple "levelzero subdevices", not to PV specifically, and there are some non-PV devices with multiple subdevices. But those are likely rare for now (discrete GPUs), at least when used in HPC. One solution would be to query level-zero subdevices instead of looking for PCI id. I am not aware of any way to query subdevices from sysfs or anything outside the L0 library.

By the way, the hwloc fix (762fdc4bfb8fc33b304511149a532a122ade395f) is basically the only change in 2.7.x after 2.7.1. I can release a 2.7.2 if needed.

rhc54 · 2022-12-31T00:08:04Z

Hmmm...that sounds like we need to simply block everything below 2.7.2? It's an ugly "fix", but I cannot think of any other way to protect against segfault - can you?

Adding an L0 dependency is possible, I suppose, but given the history we've had with that library, not something I'm wild about doing. If someone wants to contribute it (it would go in the PMIx pgpu/intel component), I'm willing to look at it. Otherwise, the HWLOC cutoff is the only solution I can see.

bgoglin · 2023-01-04T10:36:49Z

I released hwloc 2.7.2 yesterday.

rhc54 · 2023-01-04T13:57:35Z

Thanks @bgoglin! Have you received any response to the question regarding how we detect that PV is present prior to invoking HWLOC? Requiring hwloc 2.7.2 and above seems pretty onerous - is working thru L0 really the only way to reliably do it? If so, what are the chances of someone providing that code?

rhc54 · 2023-01-04T16:30:05Z

Borrowing liberally from the hwloc code in 2.8.0, what if we did something like this:

  res = zeInit(0);
  if (res != ZE_RESULT_SUCCESS) {
    return 0; // we are okay
  }

  nbdrivers = 0;
  res = zeDriverGet(&nbdrivers, NULL);
  if (res != ZE_RESULT_SUCCESS || !nbdrivers)
    return 0; // we are okay
  drh = malloc(nbdrivers * sizeof(*drh));
  if (NULL == drh)
    return error
  res = zeDriverGet(&nbdrivers, drh);
  if (res != ZE_RESULT_SUCCESS) {
    free(drh);
    return error
  }

  zeidx = 0;
  for(i=0; i<nbdrivers; i++) {
    uint32_t nbdevices, j;
    ze_device_handle_t *dvh;
    char buffer[13];

    nbdevices = 0;
    res = zeDeviceGet(drh[i], &nbdevices, NULL);
    if (res != ZE_RESULT_SUCCESS || !nbdevices)
      continue;

    dvh = malloc(nbdevices * sizeof(*dvh));
    if (!dvh)
      continue;
    res = zeDeviceGet(drh[i], &nbdevices, dvh);
    if (res != ZE_RESULT_SUCCESS) {
      free(dvh);
      continue;
    }

    for(j=0; j<nbdevices; j++) {
      uint32_t nr_subdevices;
 
      nr_subdevices = 0;
      res = zeDeviceGetSubDevices(dvh[j], &nr_subdevices, NULL);
      /* returns ZE_RESULT_ERROR_INVALID_ARGUMENT if there are no subdevices */
      if (res != ZE_RESULT_ERROR_INVALID_ARGUMENT || nr_subdevices > 0) {
          // indicates presence of subdevices
          // error out with message if hwloc version is less than 2.7.2

bgoglin · 2023-01-04T22:35:21Z

Thanks @bgoglin! Have you received any response to the question regarding how we detect that PV is present prior to invoking HWLOC? Requiring hwloc 2.7.2 and above seems pretty onerous - is working thru L0 really the only way to reliably do it? If so, what are the chances of someone providing that code?

My reply from 5 days ago is what I got from Intel + ideas from me.

Your code above might be good. I don't have PVC access to test it but I have a way to simulate L0 on a couple different Intel GPU servers if you provide a standalone C program.

rhc54 · 2023-01-05T19:56:13Z

I started looking at what it would take to enable this check, and I'm beginning to question if it is worth it. If someone wants to run on a machine that falls into this trap, then they are going to have to use an appropriate HWLOC version. I suppose it is a little nicer if we could warn them of the HWLOC issue instead of just segfaulting, but it would necessitate adding a LevelZero dependency to PMIx that it otherwise doesn't need (at least, so far - and I have no knowledge of any thinking to change that).

So I'm wondering if we should just add this to an FAQ somewhere and call it a day? If the user didn't configure PMIx --with-level-zero, then I couldn't warn them of the problem anyway - which feels like a gaping hole in the logic.

hppritcha · 2023-01-05T20:00:28Z

I'd be okay with this solution - documenting somewhere. Maybe someone from Intel might have a different opinion but my "job" here is just to get Open MPI working on Aurora and they have a hwloc 2.8.0 that works fine.

rhc54 · 2023-01-05T21:30:25Z

We discussed this a bit on the PMIx biweekly telecon today. As one participant noted, there are a number of codes that use HWLOC (including the RM) that would also be breaking and quite likely failing prior to PMIx. So there really is no useful purpose served by trying to have PMIx provide a warning.

jsquyres · 2023-01-08T22:30:50Z

Yeah, we really don't want to add anything complicated here -- I was hoping for a simple "if /sys/blah/blah/blah/point-vechio file exists and hwloc run-time version == 2.7.1, emit this warning" kind of thing. If that's not possible, so be it.

jsquyres · 2023-01-10T17:11:06Z

Per discussion on the 10 Jan 2023 webex, this has likely turned into a documentation issue. It should be noted in the v5.0.x docs somewhere. Per #11290, we can at least disregard the effects of Open MPI's internal hwloc being too old for v6.x.

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

accelerators. related to issue open-mpi#11246 Co-authored-by: Jeff Squyres <jsquyres@users.noreply.github.com> Signed-off-by: Howard Pritchard <hppritcha@gmail.com>

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit c1b5e6e)

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha changed the title ~~embedded hwloc 2.7.1 is too old for Intel ZE processors~~ embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators Dec 23, 2022

bgoglin mentioned this issue Jan 5, 2023

build-time and runtime version checks open-mpi/hwloc#558

Open

jsquyres added documentation Target: v5.0.x labels Jan 10, 2023

jsquyres added this to the v5.0.0 milestone Jan 10, 2023

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 16, 2023

docs: add note about needing hwloc 2.8.0 for PV

4abeb9d

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha mentioned this issue Mar 16, 2023

docs: add note about needing hwloc 2.8.0 for PV #11494

Merged

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 21, 2023

docs:add note about needing hwloc 2.8.0 for PV

4146538

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 21, 2023

docs:add note about needing hwloc 2.8.0 for PV

c1b5e6e

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 23, 2023

docs:add note about needing hwloc 2.8.0 for PV

6bc06b8

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit c1b5e6e)

hppritcha mentioned this issue Mar 23, 2023

docs:add note about needing hwloc 2.8.0 for PV #11522

Merged

boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023

docs:add note about needing hwloc 2.8.0 for PV

d74c293

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

awlauria closed this as completed Mar 23, 2023

boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023

docs:add note about needing hwloc 2.8.0 for PV

280bb0c

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024

docs:add note about needing hwloc 2.8.0 for PV

e1e7fd1

accelerators. related to issue open-mpi#11246 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

hppritcha commented Dec 23, 2022 •

edited

Loading

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

rhc54 commented Dec 24, 2022

bgoglin commented Dec 25, 2022

bgoglin commented Dec 30, 2022

rhc54 commented Dec 31, 2022

bgoglin commented Jan 4, 2023

rhc54 commented Jan 4, 2023

rhc54 commented Jan 4, 2023

bgoglin commented Jan 4, 2023

rhc54 commented Jan 5, 2023

hppritcha commented Jan 5, 2023

rhc54 commented Jan 5, 2023

jsquyres commented Jan 8, 2023

jsquyres commented Jan 10, 2023

embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

embedded hwloc 2.7.1 is too old for Intel processors with Ponte Vecchio accelerators #11246

Comments

hppritcha commented Dec 23, 2022 • edited Loading

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

jsquyres commented Dec 23, 2022

rhc54 commented Dec 23, 2022

hppritcha commented Dec 23, 2022

rhc54 commented Dec 24, 2022

bgoglin commented Dec 25, 2022

bgoglin commented Dec 30, 2022

rhc54 commented Dec 31, 2022

bgoglin commented Jan 4, 2023

rhc54 commented Jan 4, 2023

rhc54 commented Jan 4, 2023

bgoglin commented Jan 4, 2023

rhc54 commented Jan 5, 2023

hppritcha commented Jan 5, 2023

rhc54 commented Jan 5, 2023

jsquyres commented Jan 8, 2023

jsquyres commented Jan 10, 2023

hppritcha commented Dec 23, 2022 •

edited

Loading