NVIDIA MultiGPU support for SYCL #6749

ManjulaChalla · 2022-09-09T11:55:42Z

Hi,
I migrated CUDA code to SYCL. And the CUDA code is working fine for NVIDIA Multi
GPU environment and SYCL migrated code is resulting in a segmentation fault in the NVIDIA MultiGPU(2 or 4 GPUs environment).Both CUDA and SYCL code is working for single GPU environment.

SYCL code on NVIDIA GPU(single):

Steps to reproduce:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda MonteCarlo_kernel.cpp MonteCarloMultiGPU.cpp MonteCarlo_reduction.hpp MonteCarlo_gold.cpp multithreading.cpp

Used the above command for compilation on SYCL on NVIDIA.

SYCL code on NVIDIA Multi GPU:

I validated the SYCL code on Intel MultiGPU environment.It is working fine there.

Is there any limitation for SYCL on NVIDIA MultiGPU Hardware?

zjin-lcf · 2022-10-03T16:56:06Z

Please try DPCT when the reproducer is available.

https://github.com/oneapi-src/SYCLomatic/

JackAKirk · 2022-11-08T11:57:43Z

Hi,

Without a reproducer I can't comment on your specific use case. I'll list the ways you can use multiple GPUs with the DPC++ CUDA backend currently. Note that there is no practical limitation for SYCL on Nvidia multi-gpu hardware features and we expect full support in the future. However right now the DPC++ cuda backend does not support DirectGPU Peer to Peer capabilities. We are aiming to add the multi-gpu peer to peer access and copies features ASAP. This is a priority for us.
Despite this I want to emphasize that there are no limitations on which devices you can copy to/from: at the moment the copies just happen via the host rather than directly between peer devices.

Note that if you use MPI with DPC++ for CUDA you can already access essentially all the Multi-GPU features.
Here is the current status and recommendations:

(recommended but not yet well documented) Use MPI. DPC++ CUDA backend has now been quite well tested with the two main MPI implementation variants (MPICH and OpenMPI), and mostly works out of the box as you would expect. We will be providing complete documentation for using MPI in the 2023.0 release. The advantage of using MPI is that you can take advantage of DirectGPU P2P memory copies right now; currently there are ways of achieving memory copies across multi-devices within the cuda backend DPC++ runtime, however these copies are always made via host, and are slower than direct P2P. If you would like more information on MPI usage please let me know.
Via the runtime
2a. With SYCL USM using sycl::queue.memcpy to explicitly copy between devices:

int* arr1 = malloc<int>(N, Queues[0], usm::alloc::device);
int* arr2 = malloc<int>(N, Queues[1], usm::alloc::device);
// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

2b. With buffers relying on the runtime to manage the copies as in this test: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Basic/buffer/buffer_dev_to_dev.cpp

In all these cases the trick is to get a list of all CUDA devices that are available in your system (although if using MPI there are ways to select the gpus and map them to your MPI ranks at runtime after compilation). Currently we have a temporary situation where each CUDA device is listed within its own platform: this was done to conform to a SYCL specification constraint for the "default_selector" behaviour. Eventually, all cuda devices should be listed in a single platform and you will be able to simply select from the list of devices in sycl::platform::get_devices(). Fixing this is another priority.
For now you can do:

std::vector<sycl::device> Devs;
// look through all platforms
for (const auto &plt : sycl::platform::get_platforms()) {
// if it is a cuda "platform" then add the device to the list
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
// create queues as desired
auto q0 = sycl::queue{Devs[0]}; // this corresponds to the lowest cuda ID available (usually 0)
auto q1 = sycl::queue{Devs[1]}; // this corresponds to the second lowest cuda ID available (usually 1)
... etc

Let us know if you want more information at this stage.

zjin-lcf · 2022-11-28T15:52:37Z

// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

If the source (arr1) of the copy is located at Device 0, then the queue is Queues[0]. Is that right ?

JackAKirk · 2022-11-28T17:05:13Z

// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();
If the source (arr1) of the copy is located at Device 0, then the queue is Queues[0]. Is that right ?

It actually doesn't matter: The implementation will allow any queue to copy any usm pointers.

It works just like cudaMemcpyAsync or cuMemcpyAsync: Queues[0].copy(arr2, arr1, N) will simple call cuMemcpyAsync: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g5f26aaf5582ade791e5688727a178d78.

zjin-lcf · 2022-11-28T17:18:36Z

I migrated the CUDA simpleP2P example to SYCL. Running the program shows "illegal memory accesses" when the first kernel is executed on a device.
Please see the reproducer: https://github.com/zjin-lcf/HeCBench/blob/master/p2p-sycl/main.cpp
The CUDA program is https://github.com/zjin-lcf/HeCBench/blob/master/p2p-cuda/main.cu
I am not sure if the SYCL program is written correctly. Thanks.

JackAKirk · 2022-11-28T17:32:09Z

I expect it is due to the fact you are trying to access memory that is on one device from the queue that uses a different device here:

void SimpleKernel(nd_item<1> &item, const float *src, float *dst)
{
  // Just a dummy kernel, doing enough for us to verify that everything
  // worked
  const int idx = item.get_global_id(0);
  dst[idx] = src[idx] * 2.0f;
}

To do this we would need a finished extension here of ext_oneapi_enable_peer_access to enable peer memory access across devices #6104.
If you examine what the CUDA driver is doing whilst execution the memcpy commands in this program (if you make it compile by removing the offending kernel) you will see that a peer to peer copy is achieved via host. However you still can't do a direct peer access as the offending kernel is trying to do.

This is very easy to implement for a given backend: I actually implemented it for the cuda backend here: JackAKirk@3b36bc4

The challenge as ever is to make a DPC++ runtime / p2p extension that is appropriate in general: or at least for CUDA/HIP and level_zero backends. This requires a concerted effort to solve some questions:

What is sycl::context for
Should we map cuContext to sycl::device instead of sycl::context? (Answer: probably)

zjin-lcf · 2022-11-28T18:27:31Z

I can see the SYCL/CUDA bandwidth with the offending kernels commented. I will wait for the pull request to support the P2P example. I know little about context, and hope users don't need to know context.
If CUDA combines 'context' and 'device', could SYCL combine them too ?

Thanks

keryell · 2022-11-30T00:30:31Z

If CUDA combines 'context' and 'device', could SYCL combine them too ?

SYCL can be higher level and represent things slightly differently from the back-ends. It is necessary for portability anyway since not all the back-ends have the same abstractions.

zjin-lcf · 2023-02-11T00:12:26Z

Hi,

Without a reproducer I can't comment on your specific use case. I'll list the ways you can use multiple GPUs with the DPC++ CUDA backend currently. Note that there is no practical limitation for SYCL on Nvidia multi-gpu hardware features and we expect full support in the future. However right now the DPC++ cuda backend does not support DirectGPU Peer to Peer capabilities. We are aiming to add the multi-gpu peer to peer access and copies features ASAP. This is a priority for us. Despite this I want to emphasize that there are no limitations on which devices you can copy to/from: at the moment the copies just happen via the host rather than directly between peer devices.

Note that if you use MPI with DPC++ for CUDA you can already access essentially all the Multi-GPU features. Here is the current status and recommendations:

(recommended but not yet well documented) Use MPI. DPC++ CUDA backend has now been quite well tested with the two main MPI implementation variants (MPICH and OpenMPI), and mostly works out of the box as you would expect. We will be providing complete documentation for using MPI in the 2023.0 release. The advantage of using MPI is that you can take advantage of DirectGPU P2P memory copies right now; currently there are ways of achieving memory copies across multi-devices within the cuda backend DPC++ runtime, however these copies are always made via host, and are slower than direct P2P. If you would like more information on MPI usage please let me know.

Via the runtime
2a. With SYCL USM using sycl::queue.memcpy to explicitly copy between devices:
int* arr1 = malloc<int>(N, Queues[0], usm::alloc::device);
int* arr2 = malloc<int>(N, Queues[1], usm::alloc::device);
// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();
2b. With buffers relying on the runtime to manage the copies as in this test: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Basic/buffer/buffer_dev_to_dev.cpp

In all these cases the trick is to get a list of all CUDA devices that are available in your system (although if using MPI there are ways to select the gpus and map them to your MPI ranks at runtime after compilation). Currently we have a temporary situation where each CUDA device is listed within its own platform: this was done to conform to a SYCL specification constraint for the "default_selector" behaviour. Eventually, all cuda devices should be listed in a single platform and you will be able to simply select from the list of devices in sycl::platform::get_devices(). Fixing this is another priority. For now you can do:
std::vector<sycl::device> Devs;
// look through all platforms
for (const auto &plt : sycl::platform::get_platforms()) {
// if it is a cuda "platform" then add the device to the list
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
// create queues as desired
auto q0 = sycl::queue{Devs[0]}; // this corresponds to the lowest cuda ID available (usually 0)
auto q1 = sycl::queue{Devs[1]}; // this corresponds to the second lowest cuda ID available (usually 1)
... etc
Let us know if you want more information at this stage.

I realize that the example may be from https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/MonteCarloMultiGPU/
@ManjulaChalla may confirm this.

This implements the current extension doc from #6104 in the CUDA backend only. Fixes #7543. Fixes #6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>

BenBrock · 2024-02-23T21:03:56Z

@JackAKirk What's the current status of P2P on Nvidia? It looks to me like on an H100 system my devices are in the same SYCL platform and P2P works, but with V100 and A100 they are all in different SYCL platforms. (And thus no P2P.)

JackAKirk · 2024-02-26T09:58:59Z

@JackAKirk What's the current status of P2P on Nvidia? It looks to me like on an H100 system my devices are in the same SYCL platform and P2P works, but with V100 and A100 they are all in different SYCL platforms. (And thus no P2P.)

P2P has been tested on A100 and should work on any Nvidia devices supporting pcie/nvlink. It is true that the Nvidia backend still has the hack where each device is placed in a separate platform in order to comply with the default context extension. However this doesn't affect P2P at all. It just means that in the unusual circumstance that you have a system with different vendor gpus (e.g. amd and nvidia) you have to be careful about selecting your devices and you can't just get the Nvidia gpus from requiring that your device selector is a gpu selector, you have to populate a device list like it is done in these tests: https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/USM/P2P
If you only have nvidia gpus in your system you can more easily get the list of Nvidia devices by requesting devices with the gpu property.

BenBrock · 2024-02-27T00:30:50Z

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context? In the multi-GPU code I'm working on, we rely on some features like sycl::get_pointer_type and sycl::get_pointer_device that presumably won't work if given the context for the wrong GPU.

I think it would suffice if the user could at least manually create a single context with all the Nvidia devices.

Also, if you don't mind, could you explain how the default context extension is involved here? I'm not familiar.

JackAKirk · 2024-02-27T10:18:13Z

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context?

This work was ready but got side-tracked for merge due to higher priorities. The PR for it now needs to be updated: #10737
@hdelan do you know what the timeline for this is?

In the multi-GPU code I'm working on, we rely on some features like sycl::get_pointer_type and sycl::get_pointer_device that > presumably won't work if given the context for the wrong GPU.

You should be aware that

sycl::context has no analogue in cuda; a cuda CUcontext exists but has a completely different semantic meaning.
Generally it is a good idea to avoid dealing with sycl::context directly at all now that there is the default context extension that sets all devices in a platform in the same sycl::context by default. The default context is automatically used providing that your queues are constructed with the default context(which happens by default if you don't pass the queue a sycl::context). Dealing with sycl::context is almost always not required in any backend:

sycl::context came from opencl's context, which is used on opencl for two things:

control the device visibility of opencl buffers
compile different kernels_bundles for different devices based on which "context" they are in

We can simplify the discussion if I assume you are only using USM, but also for sycl::buffer I don't think there is a good reason to worry about touching sycl::context in code either. Unless you are compiling different kernel_bundles for different devices then there is no reason for you to think about sycl::context at all. There is a interface that takes a sycl::queue instead of a sycl::context for most sycl interfaces (like allocating usm memory), and for ones that don't have a sycl::queue variant (like sycl::get_pointer_type that you mentioned) you can just do queue::get_context(); which will use the default context that contains all devices in the platform provided you did not pass that queue a sycl::context. Basically I think that if you ever have to type sycl::context this is a mistake, and will lead to problems/confusion.

I think it would suffice if the user could at least manually create a single context with all the Nvidia devices.

It doesn't matter in the dpc++ cuda backend if Nvidia devices have a different sycl::context. Once #10737 or similar is merged they will do anyway, but already then just don't worry about sycl::context: just use queue::get_context(); if you have to use sycl::get_pointer_type (Although I'm not sure whether this sycl::get_pointer_type works in the cuda backend) and get a list of all the nvidia gpus using a gpu device selector.

Also, if you don't mind, could you explain how the default context extension is involved here? I'm not familiar.

The reason that each cuda device is in its own platform currently is because that was the only way to be compliant with the default (sycl) context extension without extensive changes to the cuda backend: The default sycl::context of a platform has to contain all devices in that platform. If the platform contains more than one device then this was troublesome for the cuda backend since it was originally written to map a sycl::context to a CUdevice (and CUcontext). Changing this was a lot of work and in order to not break the runtime the decision was made at that time to put each cuda device in a separate platform.

Again I should stress that unless you have a system containing multiple gpu vendors (like amd and intel or nvidia on the same node), you can safely use a device selector to get all the devices.
And irrespective of this just never type sycl::context (use queue constructors instead), never even think about sycl::context except where interfaces require it, and then only getting a sycl::context via queue::get_context().

hdelan · 2024-02-27T10:24:49Z

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context?

This work was ready but got side-tracked for merge due to higher priorities. The PR for it now needs to be updated: #10737 @hdelan do you know what the timeline for this is?

I will be back working on this in a few weeks so may have a patch up in a month or so in UR.

BenBrock · 2024-02-27T22:29:01Z

Thanks for the advice, @JackAKirk, I will do a bit of refactoring and see if we can get our code working on the current Nvidia backend. I can switch to allocating and deallocating memory with queues, which should be no problem. I do think ultimately we may need all the devices in the same context for things to work.

We wrote this library before the default context extension was introduced, and at the time ran into multiple bugs if we did not create and maintain a global context. That's the reason we have a context. I'm all in favor of having the library take care of the context for me, but I do want to make sure I'm writing a valid SYCL program.

My primary concern at this point is about get_pointer_type and get_pointer_device---we don't know the device when calling them, which means that we also don't know the context unless we maintain one global context. I will check if calling them with a random Nvidia context works, even though that will technically be violating the SYCL API.

BenBrock · 2024-02-29T20:09:18Z

It appears I almost have things working. I do have to implement a bit of a hack in order for get_pointer_device and get_pointer_type to work with the CUDA backend:

template <std::contiguous_iterator Iter>
sycl::device get_pointer_device(Iter iter) {
  for (auto&& device : shp::devices()) {
    try {
      return sycl::get_pointer_device(std::to_address(iter), __detail::queue(device).get_context());
    } catch(...) {}
  }
  assert(false);
}

Essentially, launch get_pointer_device with the context for each device until it succeeds. This is obviously suboptimal, but works.

I am, however, running into another asynchronous error when combining events that appears to be unrelated. I see the following asynchronous errors thrown:

	Native API failed. Native API returns: -34 (PI_ERROR_INVALID_CONTEXT) -34 (PI_ERROR_INVALID_CONTEXT)
	Couldn't wait for host-task's dependencies -34 (PI_ERROR_INVALID_CONTEXT)

These asynchronous errors are thrown when combining events. To combine events, I create a command group handler, upon which I issue a depends_on with a vector of events and an empty host task. This command group handler is launched on device 0's queue.

inline sycl::event combine_events(const std::vector<sycl::event> &events) {
  auto &&q = __detail::queue(0);
  auto e = q.submit([&](auto &&h) {
    h.depends_on(events);
    h.host_task([] {});
  });

  return e;
}

The events in events are not all launched in the context associated with q, but that shouldn't be an issue, right? Looking over the SYCL spec I see no requirement that the events be associated with the same context as the queue.

I can create my own asynchronous error handler that ignores SYCL exceptions with code 34, and things seem to work, at least for some simple examples, but it seems like I should be able to avoid this error.

BenBrock · 2024-02-29T20:28:01Z

Just did a quick test, and combining events using a queue associated with the CPU throws the same error.

JackAKirk · 2024-03-01T11:46:20Z

Essentially, launch get_pointer_device with the context for each device until it succeeds. This is obviously suboptimal, but works.

I see the issue. Apologies for this; it is unfortunate that the spec imposes that the context must be passed here and that this constraint is imposed. Unfortunately we will need the rather complicated multi-device context patch for cuda to be updated for this which we will do ASAP.

The events in events are not all launched in the context associated with q, but that shouldn't be an issue, right? Looking over the SYCL spec I see no requirement that the events be associated with the same context as the queue.

I can create my own asynchronous error handler that ignores SYCL exceptions with code 34, and things seem to work, at least for some simple examples, but it seems like I should be able to avoid this error.

Yeah you're right this is a bug. I've fixed this here: oneapi-src/unified-runtime#1403
If you are building dpc++ from intel/llvm directly you can use this patch immediately if you change your ur cuda adapter CMakeLists.txt similar to this (but the adapters/cuda/CMakeLists.txt):
JackAKirk@9c84333

But just replacing the commit id with this one: oneapi-src/unified-runtime@2077bc6

Hopefully that should get you further, although since I can't see your complete code I can't say whether or not this will be the end of your troubles!

ManjulaChalla added the bug Something isn't working label Sep 9, 2022

steffenlarsen added the cuda CUDA back-end label Sep 13, 2022

AerialMantis added the runtime Runtime library related issue label Sep 14, 2022

JackAKirk mentioned this issue Nov 28, 2022

[SYCL][CUDA] memory access across peer devices #7543

Closed

JackAKirk mentioned this issue Feb 10, 2023

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension #8303

Merged

dm-vodopyanov closed this as completed in #8303 Jul 10, 2023

JackAKirk mentioned this issue Mar 1, 2024

[cuda][hip] Removed bad event wait constraint / use scopedContext correctly. oneapi-src/unified-runtime#1403

Merged

JackAKirk mentioned this issue May 1, 2024

[CUDA] CUDA adapter multi device context oneapi-src/unified-runtime#1565

Merged

JackAKirk mentioned this issue May 27, 2024

[CUDA][HIP] Fix urUSMGetMemAllocInfo impl to use single platform. oneapi-src/unified-runtime#1669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA MultiGPU support for SYCL #6749

NVIDIA MultiGPU support for SYCL #6749

ManjulaChalla commented Sep 9, 2022

zjin-lcf commented Oct 3, 2022

JackAKirk commented Nov 8, 2022 •

edited

Loading

zjin-lcf commented Nov 28, 2022

JackAKirk commented Nov 28, 2022

zjin-lcf commented Nov 28, 2022 •

edited

Loading

JackAKirk commented Nov 28, 2022 •

edited

Loading

zjin-lcf commented Nov 28, 2022

keryell commented Nov 30, 2022

zjin-lcf commented Feb 11, 2023

BenBrock commented Feb 23, 2024

JackAKirk commented Feb 26, 2024

BenBrock commented Feb 27, 2024

JackAKirk commented Feb 27, 2024 •

edited

Loading

hdelan commented Feb 27, 2024

BenBrock commented Feb 27, 2024

BenBrock commented Feb 29, 2024

BenBrock commented Feb 29, 2024

JackAKirk commented Mar 1, 2024 •

edited

Loading

NVIDIA MultiGPU support for SYCL #6749

NVIDIA MultiGPU support for SYCL #6749

Comments

ManjulaChalla commented Sep 9, 2022

zjin-lcf commented Oct 3, 2022

JackAKirk commented Nov 8, 2022 • edited Loading

zjin-lcf commented Nov 28, 2022

JackAKirk commented Nov 28, 2022

zjin-lcf commented Nov 28, 2022 • edited Loading

JackAKirk commented Nov 28, 2022 • edited Loading

zjin-lcf commented Nov 28, 2022

keryell commented Nov 30, 2022

zjin-lcf commented Feb 11, 2023

BenBrock commented Feb 23, 2024

JackAKirk commented Feb 26, 2024

BenBrock commented Feb 27, 2024

JackAKirk commented Feb 27, 2024 • edited Loading

hdelan commented Feb 27, 2024

BenBrock commented Feb 27, 2024

BenBrock commented Feb 29, 2024

BenBrock commented Feb 29, 2024

JackAKirk commented Mar 1, 2024 • edited Loading

JackAKirk commented Nov 8, 2022 •

edited

Loading

zjin-lcf commented Nov 28, 2022 •

edited

Loading

JackAKirk commented Nov 28, 2022 •

edited

Loading

JackAKirk commented Feb 27, 2024 •

edited

Loading

JackAKirk commented Mar 1, 2024 •

edited

Loading