-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA MultiGPU support for SYCL #6749
NVIDIA MultiGPU support for SYCL #6749
Comments
Please try DPCT when the reproducer is available. |
Hi, Without a reproducer I can't comment on your specific use case. I'll list the ways you can use multiple GPUs with the DPC++ CUDA backend currently. Note that there is no practical limitation for SYCL on Nvidia multi-gpu hardware features and we expect full support in the future. However right now the DPC++ cuda backend does not support DirectGPU Peer to Peer capabilities. We are aiming to add the multi-gpu peer to peer access and copies features ASAP. This is a priority for us. Note that if you use MPI with DPC++ for CUDA you can already access essentially all the Multi-GPU features.
2b. With buffers relying on the runtime to manage the copies as in this test: https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Basic/buffer/buffer_dev_to_dev.cpp In all these cases the trick is to get a list of all CUDA devices that are available in your system (although if using MPI there are ways to select the gpus and map them to your MPI ranks at runtime after compilation). Currently we have a temporary situation where each CUDA device is listed within its own platform: this was done to conform to a SYCL specification constraint for the "default_selector" behaviour. Eventually, all cuda devices should be listed in a single platform and you will be able to simply select from the list of devices in
Let us know if you want more information at this stage. |
If the source (arr1) of the copy is located at Device 0, then the queue is Queues[0]. Is that right ? |
It actually doesn't matter: The implementation will allow any queue to copy any usm pointers. It works just like |
I migrated the CUDA simpleP2P example to SYCL. Running the program shows "illegal memory accesses" when the first kernel is executed on a device. |
I expect it is due to the fact you are trying to access memory that is on one device from the queue that uses a different device here:
To do this we would need a finished extension here of This is very easy to implement for a given backend: I actually implemented it for the cuda backend here: JackAKirk@3b36bc4 The challenge as ever is to make a DPC++ runtime / p2p extension that is appropriate in general: or at least for CUDA/HIP and level_zero backends. This requires a concerted effort to solve some questions:
|
I can see the SYCL/CUDA bandwidth with the offending kernels commented. I will wait for the pull request to support the P2P example. I know little about context, and hope users don't need to know context. Thanks |
SYCL can be higher level and represent things slightly differently from the back-ends. It is necessary for portability anyway since not all the back-ends have the same abstractions. |
I realize that the example may be from https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/MonteCarloMultiGPU/ |
This implements the current extension doc from #6104 in the CUDA backend only. Fixes #7543. Fixes #6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
) This implements the current extension doc from intel#6104 in the CUDA backend only. Fixes intel#7543. Fixes intel#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
This implements the current extension doc from intel/llvm#6104 in the CUDA backend only. Fixes intel/llvm#7543. Fixes intel/llvm#6749. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com> Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
@JackAKirk What's the current status of P2P on Nvidia? It looks to me like on an H100 system my devices are in the same SYCL platform and P2P works, but with V100 and A100 they are all in different SYCL platforms. (And thus no P2P.) |
P2P has been tested on A100 and should work on any Nvidia devices supporting pcie/nvlink. It is true that the Nvidia backend still has the hack where each device is placed in a separate platform in order to comply with the default context extension. However this doesn't affect P2P at all. It just means that in the unusual circumstance that you have a system with different vendor gpus (e.g. amd and nvidia) you have to be careful about selecting your devices and you can't just get the Nvidia gpus from requiring that your device selector is a gpu selector, you have to populate a device list like it is done in these tests: https://github.com/intel/llvm/tree/sycl/sycl/test-e2e/USM/P2P |
Is there a timeline for when this will be updated to put all the Nvidia devices in a single context? In the multi-GPU code I'm working on, we rely on some features like I think it would suffice if the user could at least manually create a single context with all the Nvidia devices. Also, if you don't mind, could you explain how the default context extension is involved here? I'm not familiar. |
This work was ready but got side-tracked for merge due to higher priorities. The PR for it now needs to be updated: #10737
You should be aware that
We can simplify the discussion if I assume you are only using USM, but also for
It doesn't matter in the dpc++ cuda backend if Nvidia devices have a different
The reason that each cuda device is in its own platform currently is because that was the only way to be compliant with the default (sycl) context extension without extensive changes to the cuda backend: The default sycl::context of a platform has to contain all devices in that platform. If the platform contains more than one device then this was troublesome for the cuda backend since it was originally written to map a sycl::context to a CUdevice (and CUcontext). Changing this was a lot of work and in order to not break the runtime the decision was made at that time to put each cuda device in a separate platform. Again I should stress that unless you have a system containing multiple gpu vendors (like amd and intel or nvidia on the same node), you can safely use a device selector to get all the devices. |
I will be back working on this in a few weeks so may have a patch up in a month or so in UR. |
Thanks for the advice, @JackAKirk, I will do a bit of refactoring and see if we can get our code working on the current Nvidia backend. I can switch to allocating and deallocating memory with queues, which should be no problem. I do think ultimately we may need all the devices in the same context for things to work. We wrote this library before the default context extension was introduced, and at the time ran into multiple bugs if we did not create and maintain a global context. That's the reason we have a context. I'm all in favor of having the library take care of the context for me, but I do want to make sure I'm writing a valid SYCL program. My primary concern at this point is about |
It appears I almost have things working. I do have to implement a bit of a hack in order for template <std::contiguous_iterator Iter>
sycl::device get_pointer_device(Iter iter) {
for (auto&& device : shp::devices()) {
try {
return sycl::get_pointer_device(std::to_address(iter), __detail::queue(device).get_context());
} catch(...) {}
}
assert(false);
} Essentially, launch I am, however, running into another asynchronous error when combining events that appears to be unrelated. I see the following asynchronous errors thrown:
These asynchronous errors are thrown when combining events. To combine events, I create a command group handler, upon which I issue a inline sycl::event combine_events(const std::vector<sycl::event> &events) {
auto &&q = __detail::queue(0);
auto e = q.submit([&](auto &&h) {
h.depends_on(events);
h.host_task([] {});
});
return e;
} The events in I can create my own asynchronous error handler that ignores SYCL exceptions with code 34, and things seem to work, at least for some simple examples, but it seems like I should be able to avoid this error. |
Just did a quick test, and combining events using a queue associated with the CPU throws the same error. |
I see the issue. Apologies for this; it is unfortunate that the spec imposes that the context must be passed here and that this constraint is imposed. Unfortunately we will need the rather complicated multi-device context patch for cuda to be updated for this which we will do ASAP.
Yeah you're right this is a bug. I've fixed this here: oneapi-src/unified-runtime#1403 But just replacing the commit id with this one: oneapi-src/unified-runtime@2077bc6 Hopefully that should get you further, although since I can't see your complete code I can't say whether or not this will be the end of your troubles! |
Hi,
I migrated CUDA code to SYCL. And the CUDA code is working fine for NVIDIA Multi
GPU environment and SYCL migrated code is resulting in a segmentation fault in the NVIDIA MultiGPU(2 or 4 GPUs environment).Both CUDA and SYCL code is working for single GPU environment.
SYCL code on NVIDIA GPU(single):
Steps to reproduce:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda MonteCarlo_kernel.cpp MonteCarloMultiGPU.cpp MonteCarlo_reduction.hpp MonteCarlo_gold.cpp multithreading.cpp
Used the above command for compilation on SYCL on NVIDIA.
SYCL code on NVIDIA Multi GPU:
I validated the SYCL code on Intel MultiGPU environment.It is working fine there.
Is there any limitation for SYCL on NVIDIA MultiGPU Hardware?
The text was updated successfully, but these errors were encountered: