Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA MultiGPU support for SYCL #6749

Closed
ManjulaChalla opened this issue Sep 9, 2022 · 18 comments · Fixed by #8303 or oneapi-src/unified-runtime#1669
Closed

NVIDIA MultiGPU support for SYCL #6749

ManjulaChalla opened this issue Sep 9, 2022 · 18 comments · Fixed by #8303 or oneapi-src/unified-runtime#1669
Labels
bug Something isn't working cuda CUDA back-end runtime Runtime library related issue

Comments

@ManjulaChalla
Copy link

Hi,
I migrated CUDA code to SYCL. And the CUDA code is working fine for NVIDIA Multi
GPU environment and SYCL migrated code is resulting in a segmentation fault in the NVIDIA MultiGPU(2 or 4 GPUs environment).Both CUDA and SYCL code is working for single GPU environment.

SYCL code on NVIDIA GPU(single):
Montecarlo-single GPU

Steps to reproduce:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda MonteCarlo_kernel.cpp MonteCarloMultiGPU.cpp MonteCarlo_reduction.hpp MonteCarlo_gold.cpp multithreading.cpp

Used the above command for compilation on SYCL on NVIDIA.

SYCL code on NVIDIA Multi GPU:
montecarlo-issue

I validated the SYCL code on Intel MultiGPU environment.It is working fine there.

Is there any limitation for SYCL on NVIDIA MultiGPU Hardware?

@ManjulaChalla ManjulaChalla added the bug Something isn't working label Sep 9, 2022
@steffenlarsen steffenlarsen added the cuda CUDA back-end label Sep 13, 2022
@AerialMantis AerialMantis added the runtime Runtime library related issue label Sep 14, 2022
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Oct 3, 2022

Please try DPCT when the reproducer is available.

https://github.com/oneapi-src/SYCLomatic/

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 8, 2022

Hi,

Without a reproducer I can't comment on your specific use case. I'll list the ways you can use multiple GPUs with the DPC++ CUDA backend currently. Note that there is no practical limitation for SYCL on Nvidia multi-gpu hardware features and we expect full support in the future. However right now the DPC++ cuda backend does not support DirectGPU Peer to Peer capabilities. We are aiming to add the multi-gpu peer to peer access and copies features ASAP. This is a priority for us.
Despite this I want to emphasize that there are no limitations on which devices you can copy to/from: at the moment the copies just happen via the host rather than directly between peer devices.

Note that if you use MPI with DPC++ for CUDA you can already access essentially all the Multi-GPU features.
Here is the current status and recommendations:

  1. (recommended but not yet well documented) Use MPI. DPC++ CUDA backend has now been quite well tested with the two main MPI implementation variants (MPICH and OpenMPI), and mostly works out of the box as you would expect. We will be providing complete documentation for using MPI in the 2023.0 release. The advantage of using MPI is that you can take advantage of DirectGPU P2P memory copies right now; currently there are ways of achieving memory copies across multi-devices within the cuda backend DPC++ runtime, however these copies are always made via host, and are slower than direct P2P. If you would like more information on MPI usage please let me know.

  2. Via the runtime
    2a. With SYCL USM using sycl::queue.memcpy to explicitly copy between devices:

int* arr1 = malloc<int>(N, Queues[0], usm::alloc::device);
int* arr2 = malloc<int>(N, Queues[1], usm::alloc::device);
// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

2b. With buffers relying on the runtime to manage the copies as in this test: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Basic/buffer/buffer_dev_to_dev.cpp

In all these cases the trick is to get a list of all CUDA devices that are available in your system (although if using MPI there are ways to select the gpus and map them to your MPI ranks at runtime after compilation). Currently we have a temporary situation where each CUDA device is listed within its own platform: this was done to conform to a SYCL specification constraint for the "default_selector" behaviour. Eventually, all cuda devices should be listed in a single platform and you will be able to simply select from the list of devices in sycl::platform::get_devices(). Fixing this is another priority.
For now you can do:

std::vector<sycl::device> Devs;
// look through all platforms
for (const auto &plt : sycl::platform::get_platforms()) {
// if it is a cuda "platform" then add the device to the list
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
// create queues as desired
auto q0 = sycl::queue{Devs[0]}; // this corresponds to the lowest cuda ID available (usually 0)
auto q1 = sycl::queue{Devs[1]}; // this corresponds to the second lowest cuda ID available (usually 1)
... etc

Let us know if you want more information at this stage.

@zjin-lcf
Copy link
Contributor

// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

If the source (arr1) of the copy is located at Device 0, then the queue is Queues[0]. Is that right ?

@JackAKirk
Copy link
Contributor

// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

If the source (arr1) of the copy is located at Device 0, then the queue is Queues[0]. Is that right ?

It actually doesn't matter: The implementation will allow any queue to copy any usm pointers.

It works just like cudaMemcpyAsync or cuMemcpyAsync: Queues[0].copy(arr2, arr1, N) will simple call cuMemcpyAsync: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g5f26aaf5582ade791e5688727a178d78.

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Nov 28, 2022

I migrated the CUDA simpleP2P example to SYCL. Running the program shows "illegal memory accesses" when the first kernel is executed on a device.
Please see the reproducer: https://github.com/zjin-lcf/HeCBench/blob/master/p2p-sycl/main.cpp
The CUDA program is https://github.com/zjin-lcf/HeCBench/blob/master/p2p-cuda/main.cu
I am not sure if the SYCL program is written correctly. Thanks.

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 28, 2022

I expect it is due to the fact you are trying to access memory that is on one device from the queue that uses a different device here:

void SimpleKernel(nd_item<1> &item, const float *src, float *dst)
{
  // Just a dummy kernel, doing enough for us to verify that everything
  // worked
  const int idx = item.get_global_id(0);
  dst[idx] = src[idx] * 2.0f;
}

To do this we would need a finished extension here of ext_oneapi_enable_peer_access to enable peer memory access across devices #6104.
If you examine what the CUDA driver is doing whilst execution the memcpy commands in this program (if you make it compile by removing the offending kernel) you will see that a peer to peer copy is achieved via host. However you still can't do a direct peer access as the offending kernel is trying to do.

This is very easy to implement for a given backend: I actually implemented it for the cuda backend here: JackAKirk@3b36bc4

The challenge as ever is to make a DPC++ runtime / p2p extension that is appropriate in general: or at least for CUDA/HIP and level_zero backends. This requires a concerted effort to solve some questions:

  1. What is sycl::context for
  2. Should we map cuContext to sycl::device instead of sycl::context? (Answer: probably)

@zjin-lcf
Copy link
Contributor

I can see the SYCL/CUDA bandwidth with the offending kernels commented. I will wait for the pull request to support the P2P example. I know little about context, and hope users don't need to know context.
If CUDA combines 'context' and 'device', could SYCL combine them too ?

Thanks

@keryell
Copy link
Contributor

keryell commented Nov 30, 2022

If CUDA combines 'context' and 'device', could SYCL combine them too ?

SYCL can be higher level and represent things slightly differently from the back-ends. It is necessary for portability anyway since not all the back-ends have the same abstractions.

@zjin-lcf
Copy link
Contributor

Hi,

Without a reproducer I can't comment on your specific use case. I'll list the ways you can use multiple GPUs with the DPC++ CUDA backend currently. Note that there is no practical limitation for SYCL on Nvidia multi-gpu hardware features and we expect full support in the future. However right now the DPC++ cuda backend does not support DirectGPU Peer to Peer capabilities. We are aiming to add the multi-gpu peer to peer access and copies features ASAP. This is a priority for us. Despite this I want to emphasize that there are no limitations on which devices you can copy to/from: at the moment the copies just happen via the host rather than directly between peer devices.

Note that if you use MPI with DPC++ for CUDA you can already access essentially all the Multi-GPU features. Here is the current status and recommendations:

  1. (recommended but not yet well documented) Use MPI. DPC++ CUDA backend has now been quite well tested with the two main MPI implementation variants (MPICH and OpenMPI), and mostly works out of the box as you would expect. We will be providing complete documentation for using MPI in the 2023.0 release. The advantage of using MPI is that you can take advantage of DirectGPU P2P memory copies right now; currently there are ways of achieving memory copies across multi-devices within the cuda backend DPC++ runtime, however these copies are always made via host, and are slower than direct P2P. If you would like more information on MPI usage please let me know.
  2. Via the runtime
    2a. With SYCL USM using sycl::queue.memcpy to explicitly copy between devices:
int* arr1 = malloc<int>(N, Queues[0], usm::alloc::device);
int* arr2 = malloc<int>(N, Queues[1], usm::alloc::device);
// Copy device usm allocated with different devices
Queues[0].copy(arr2, arr1, N).wait();

2b. With buffers relying on the runtime to manage the copies as in this test: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Basic/buffer/buffer_dev_to_dev.cpp

In all these cases the trick is to get a list of all CUDA devices that are available in your system (although if using MPI there are ways to select the gpus and map them to your MPI ranks at runtime after compilation). Currently we have a temporary situation where each CUDA device is listed within its own platform: this was done to conform to a SYCL specification constraint for the "default_selector" behaviour. Eventually, all cuda devices should be listed in a single platform and you will be able to simply select from the list of devices in sycl::platform::get_devices(). Fixing this is another priority. For now you can do:

std::vector<sycl::device> Devs;
// look through all platforms
for (const auto &plt : sycl::platform::get_platforms()) {
// if it is a cuda "platform" then add the device to the list
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
// create queues as desired
auto q0 = sycl::queue{Devs[0]}; // this corresponds to the lowest cuda ID available (usually 0)
auto q1 = sycl::queue{Devs[1]}; // this corresponds to the second lowest cuda ID available (usually 1)
... etc

Let us know if you want more information at this stage.

I realize that the example may be from https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/MonteCarloMultiGPU/
@ManjulaChalla may confirm this.

dm-vodopyanov pushed a commit that referenced this issue Jul 10, 2023
This implements the current extension doc from
#6104 in the CUDA backend only.

Fixes #7543.
Fixes #6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
veselypeta pushed a commit to veselypeta/llvm that referenced this issue Sep 21, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to fabiomestre/llvm that referenced this issue Sep 26, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to fabiomestre/unified-runtime that referenced this issue Sep 26, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to oneapi-src/unified-runtime that referenced this issue Sep 27, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this issue Oct 23, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this issue Oct 23, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
@BenBrock
Copy link

@JackAKirk What's the current status of P2P on Nvidia? It looks to me like on an H100 system my devices are in the same SYCL platform and P2P works, but with V100 and A100 they are all in different SYCL platforms. (And thus no P2P.)

@JackAKirk
Copy link
Contributor

@JackAKirk What's the current status of P2P on Nvidia? It looks to me like on an H100 system my devices are in the same SYCL platform and P2P works, but with V100 and A100 they are all in different SYCL platforms. (And thus no P2P.)

P2P has been tested on A100 and should work on any Nvidia devices supporting pcie/nvlink. It is true that the Nvidia backend still has the hack where each device is placed in a separate platform in order to comply with the default context extension. However this doesn't affect P2P at all. It just means that in the unusual circumstance that you have a system with different vendor gpus (e.g. amd and nvidia) you have to be careful about selecting your devices and you can't just get the Nvidia gpus from requiring that your device selector is a gpu selector, you have to populate a device list like it is done in these tests: https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/USM/P2P
If you only have nvidia gpus in your system you can more easily get the list of Nvidia devices by requesting devices with the gpu property.

@BenBrock
Copy link

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context? In the multi-GPU code I'm working on, we rely on some features like sycl::get_pointer_type and sycl::get_pointer_device that presumably won't work if given the context for the wrong GPU.

I think it would suffice if the user could at least manually create a single context with all the Nvidia devices.

Also, if you don't mind, could you explain how the default context extension is involved here? I'm not familiar.

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 27, 2024

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context?

This work was ready but got side-tracked for merge due to higher priorities. The PR for it now needs to be updated: #10737
@hdelan do you know what the timeline for this is?

In the multi-GPU code I'm working on, we rely on some features like sycl::get_pointer_type and sycl::get_pointer_device that > presumably won't work if given the context for the wrong GPU.

You should be aware that

  1. sycl::context has no analogue in cuda; a cuda CUcontext exists but has a completely different semantic meaning.
  2. Generally it is a good idea to avoid dealing with sycl::context directly at all now that there is the default context extension that sets all devices in a platform in the same sycl::context by default. The default context is automatically used providing that your queues are constructed with the default context(which happens by default if you don't pass the queue a sycl::context). Dealing with sycl::context is almost always not required in any backend:

sycl::context came from opencl's context, which is used on opencl for two things:

  • control the device visibility of opencl buffers
  • compile different kernels_bundles for different devices based on which "context" they are in

We can simplify the discussion if I assume you are only using USM, but also for sycl::buffer I don't think there is a good reason to worry about touching sycl::context in code either. Unless you are compiling different kernel_bundles for different devices then there is no reason for you to think about sycl::context at all. There is a interface that takes a sycl::queue instead of a sycl::context for most sycl interfaces (like allocating usm memory), and for ones that don't have a sycl::queue variant (like sycl::get_pointer_type that you mentioned) you can just do queue::get_context(); which will use the default context that contains all devices in the platform provided you did not pass that queue a sycl::context. Basically I think that if you ever have to type sycl::context this is a mistake, and will lead to problems/confusion.

I think it would suffice if the user could at least manually create a single context with all the Nvidia devices.

It doesn't matter in the dpc++ cuda backend if Nvidia devices have a different sycl::context. Once #10737 or similar is merged they will do anyway, but already then just don't worry about sycl::context: just use queue::get_context(); if you have to use sycl::get_pointer_type (Although I'm not sure whether this sycl::get_pointer_type works in the cuda backend) and get a list of all the nvidia gpus using a gpu device selector.

Also, if you don't mind, could you explain how the default context extension is involved here? I'm not familiar.

The reason that each cuda device is in its own platform currently is because that was the only way to be compliant with the default (sycl) context extension without extensive changes to the cuda backend: The default sycl::context of a platform has to contain all devices in that platform. If the platform contains more than one device then this was troublesome for the cuda backend since it was originally written to map a sycl::context to a CUdevice (and CUcontext). Changing this was a lot of work and in order to not break the runtime the decision was made at that time to put each cuda device in a separate platform.

Again I should stress that unless you have a system containing multiple gpu vendors (like amd and intel or nvidia on the same node), you can safely use a device selector to get all the devices.
And irrespective of this just never type sycl::context (use queue constructors instead), never even think about sycl::context except where interfaces require it, and then only getting a sycl::context via queue::get_context().

@hdelan
Copy link
Contributor

hdelan commented Feb 27, 2024

Is there a timeline for when this will be updated to put all the Nvidia devices in a single context?

This work was ready but got side-tracked for merge due to higher priorities. The PR for it now needs to be updated: #10737 @hdelan do you know what the timeline for this is?

I will be back working on this in a few weeks so may have a patch up in a month or so in UR.

@BenBrock
Copy link

Thanks for the advice, @JackAKirk, I will do a bit of refactoring and see if we can get our code working on the current Nvidia backend. I can switch to allocating and deallocating memory with queues, which should be no problem. I do think ultimately we may need all the devices in the same context for things to work.

We wrote this library before the default context extension was introduced, and at the time ran into multiple bugs if we did not create and maintain a global context. That's the reason we have a context. I'm all in favor of having the library take care of the context for me, but I do want to make sure I'm writing a valid SYCL program.

My primary concern at this point is about get_pointer_type and get_pointer_device---we don't know the device when calling them, which means that we also don't know the context unless we maintain one global context. I will check if calling them with a random Nvidia context works, even though that will technically be violating the SYCL API.

@BenBrock
Copy link

It appears I almost have things working. I do have to implement a bit of a hack in order for get_pointer_device and get_pointer_type to work with the CUDA backend:

template <std::contiguous_iterator Iter>
sycl::device get_pointer_device(Iter iter) {
  for (auto&& device : shp::devices()) {
    try {
      return sycl::get_pointer_device(std::to_address(iter), __detail::queue(device).get_context());
    } catch(...) {}
  }
  assert(false);
}

Essentially, launch get_pointer_device with the context for each device until it succeeds. This is obviously suboptimal, but works.

I am, however, running into another asynchronous error when combining events that appears to be unrelated. I see the following asynchronous errors thrown:

	Native API failed. Native API returns: -34 (PI_ERROR_INVALID_CONTEXT) -34 (PI_ERROR_INVALID_CONTEXT)
	Couldn't wait for host-task's dependencies -34 (PI_ERROR_INVALID_CONTEXT)

These asynchronous errors are thrown when combining events. To combine events, I create a command group handler, upon which I issue a depends_on with a vector of events and an empty host task. This command group handler is launched on device 0's queue.

inline sycl::event combine_events(const std::vector<sycl::event> &events) {
  auto &&q = __detail::queue(0);
  auto e = q.submit([&](auto &&h) {
    h.depends_on(events);
    h.host_task([] {});
  });

  return e;
}

The events in events are not all launched in the context associated with q, but that shouldn't be an issue, right? Looking over the SYCL spec I see no requirement that the events be associated with the same context as the queue.

I can create my own asynchronous error handler that ignores SYCL exceptions with code 34, and things seem to work, at least for some simple examples, but it seems like I should be able to avoid this error.

@BenBrock
Copy link

Just did a quick test, and combining events using a queue associated with the CPU throws the same error.

@JackAKirk
Copy link
Contributor

JackAKirk commented Mar 1, 2024

Essentially, launch get_pointer_device with the context for each device until it succeeds. This is obviously suboptimal, but works.

I see the issue. Apologies for this; it is unfortunate that the spec imposes that the context must be passed here and that this constraint is imposed. Unfortunately we will need the rather complicated multi-device context patch for cuda to be updated for this which we will do ASAP.

The events in events are not all launched in the context associated with q, but that shouldn't be an issue, right? Looking over the SYCL spec I see no requirement that the events be associated with the same context as the queue.

I can create my own asynchronous error handler that ignores SYCL exceptions with code 34, and things seem to work, at least for some simple examples, but it seems like I should be able to avoid this error.

Yeah you're right this is a bug. I've fixed this here: oneapi-src/unified-runtime#1403
If you are building dpc++ from intel/llvm directly you can use this patch immediately if you change your ur cuda adapter CMakeLists.txt similar to this (but the adapters/cuda/CMakeLists.txt):
JackAKirk@9c84333

But just replacing the commit id with this one: oneapi-src/unified-runtime@2077bc6

Hopefully that should get you further, although since I can't see your complete code I can't say whether or not this will be the end of your troubles!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end runtime Runtime library related issue
Projects
None yet
8 participants