[SYCL][CUDA] Cuda adapter multi device context #10737

hdelan · 2023-08-08T15:32:33Z

This patch enables a multi device context in CUDA adapter. This also enables a multi-device platform as a side effect. Management of memory across buffers within the same context is now handled in the CUDA adapter.

…ntics

hdelan · 2023-08-08T15:49:56Z

Some basic perf results on Geforce 1050Ti:

Benchmark	Single Device Context	Multi-device-context	Speedup
backprop-sycl	0.048693	0.048581	1.002305428048002
gaussian-sycl	99358	99268	1.000906636579764
histogram-sycl	9767.172	9759.001	1.000837278323878
nbody-sycl	1.70932	1.71094	0.9990531520684536
su3-sycl	1.966407	1.979619	0.9933259884856631
adv-sycl	928737	932120	0.9963706389735227
aidw-sycl	0.824413	0.818178	1.007620591118314
all-pairs-distance-sycl	5026.992188	5014.033936	1.002584396548847
aobench-sycl	0.001803	0.001803	1
asmooth-sycl	0.170765	0.17031	1.002671598849157
asta-sycl	0.017901	0.017908	0.9996091132454769
bezier-surface-sycl	111	111	1
bilateral-sycl	28.169319	28.181039	0.9995841175337787
binomial-sycl	8989.490234000001	8973.286133	1.001805815702278
bitonic-sort-sycl	95.02151499999999	95.005402	1.000169600882274
bonds-sycl	986.8431	986.6414	1.000204430910765
bsearch-sycl	0.000153812	0.000152007	1.011874453150184
bspline-vgh-sycl	0.206516	0.207474	0.9953825539585683
burger-sycl	0.734372	0.733031	1.001829390571477
cbsfil-sycl	0.099216	0.099151	1.000655565753245
ccsd-trpdrv-sycl	0.000791	0.000701	1.128388017118402
clenergy-sycl	4.17859	4.187436	0.9978874901013413
cobahh-sycl	555.821716	556.473145	0.9988293612982887
complex-sycl	0.077446	0.077444	1.000025825112339
convolutionSeparable-sycl	0.014627	0.014627	1
crs-sycl	0.036454	0.036363	1.002502543794517
dp-sycl	0.003665	0.003665	1
eigenvalue-sycl	1678.97	1679.49	0.9996903821993581
entropy-sycl	0.017896	0.017893	1.000167663332029
epistasis-sycl	0.00409425	0.00410456	0.9974881595103982
ert-sycl	145.247675	145.222663	1.000172232070968
expdist-sycl	0.274393	0.274343	1.000182253602242
extend2-sycl	2712.250732	2713.319824	0.9996059837876302
fft-sycl	0.080722	0.0807631	0.9994911042295306
fsm-sycl	0.719418	0.719235	1.000254437005986
gpp-sycl	0.11341	0.113404	1.00005290818666
ising-sycl	0.430908	0.431015	0.9997517487790449
iso2dfd-sycl	53.7751	53.8724	0.9981938803543188
jacobi-sycl	3.72	3.716	1.001076426264801
laplace-sycl	4.917153	4.864695	1.010783409854061
lavaMD-sycl	0.086104996502	0.08632799983	0.9974167902831161
merkle-sycl	40380.22	40349.03	1.00077300495204
minimod-sycl	0.863004	0.863306	0.9996501819748733
minisweep-sycl	0.171	0.171	1
mnist-sycl	0.003968	0.003961	1.001767230497349
mr-sycl	29.567129	29.555849	1.000381650346096
murmurhash3-sycl	0.071034	0.070393	1.009106019064396
nbnxm-sycl	385.053436	384.836548	1.000563584724806
page-rank-sycl	0.005491	0.005493	0.9996359002366649
particle-diffusion-sycl	0.000456054	0.00045785	0.9960773178988752
particlefilter-sycl	0.790151	0.7894679999999999	1.000865139562338
pathfinder-sycl	0.07778500000000001	0.07881100000000001	0.9869815127329942
pns-sycl	0.49	0.49	1
s3d-sycl	2.256283	2.256273	1.000004432087783
tridiagonal-sycl	0.02788	0.02791	0.9989251164457184
vanGenuchten-sycl	0.015396	0.015396	1
attention-sycl	28.927914	28.930927	0.9998958553937798
bh-sycl	0.0113	0.0113	1
bn-sycl	11.096324	11.108195	0.9989313295274344
sheath-sycl	5.34	5.35	0.9981308411214954
----	---	----	---
		Mean speedup:	1.0023
		Geomean speedup:	1.0022

I don't think we can reasonably say that this is patch is actually giving a speedup, but maybe we can be assured that it is not causing a terrible perf degradation.

hdelan · 2023-08-08T16:00:57Z

This patch also changes the ScopedContext to ScopedDevice, in line with #10672

sycl/plugins/unified_runtime/ur/adapters/cuda/context.cpp

sycl/plugins/unified_runtime/ur/adapters/cuda/context.hpp

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp

sycl/plugins/unified_runtime/ur/adapters/cuda/event.hpp

sycl/plugins/unified_runtime/ur/adapters/cuda/memory.cpp

jchlanda · 2023-08-10T08:48:07Z

I think folding in of the rename of ScopedContext introduced a lot more noise, and should have been avoided, especially that it has a dedicated PR (#10672). Just a side note, nothing major.

hdelan · 2023-08-10T08:50:35Z

I think folding in of the rename of ScopedContext introduced a lot more noise, and should have been avoided, especially that it has a dedicated PR (#10672). Just a side note, nothing major.

Yeah I agree but semantic changes were happening in this PR since the 1-to-1 mapping of ur_context_handle_t_ to native context was being removed for the first time. It seems to make sense to change the naming to reflect this

hdelan · 2023-08-14T15:09:30Z

We'll also need this for HIP, do you think there's anything we can re-use there or will we just have to mirror the changes?

We should be able to reuse a lot of the logic. Will start working on HIP version as soon as this merges

npmiller · 2023-08-23T08:31:20Z

sycl/plugins/unified_runtime/ur/adapters/cuda/context.cpp

+  UR_ASSERT(phNativeContext, UR_RESULT_ERROR_INVALID_NULL_POINTER);
+
+  *phNativeContext =
+      reinterpret_cast<ur_native_handle_t>(hContext->getDevices()[0]);


Could we make the ur_native_handle_t a list of context here and do the [0] in the SYCL runtime (I think we still have two interop interfaces for CUDA and one of them returns a list of contexts)

sycl/plugins/unified_runtime/ur/adapters/cuda/context.hpp

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp

npmiller · 2023-08-23T08:47:08Z

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp

+              ur_cast<ur_buffer_ *>(MemArg.Mem)->LastEventWritingToMemObj;
+          MemDepEvent && std::find(DepEvents.begin(), DepEvents.end(),
+                                   MemDepEvent) == DepEvents.end()) {
+        DepEvents.push_back(MemDepEvent);


Since CUDA streams are in-order would it be simpler to just enqueue the copies on the same stream as the commands? That way we might be able to avoid dealing with extra events and synchronization and just piggy back on the regular command's events and synchronization

We can enqueue the copies on the same stream but in the case that depEvents are kernels on a different device we can't rely on streams. We need to use CUEvents instead.

Wouldn't that be covered by the regular command's events?

No the regular commands events are only the ones that are passed to the adapter by the RT, which may be from a queue from a different SYCL context. Since we are unifying the context then the SYCL RT knows that it doesn't need to use dep events within the same context because the PI will handle these dependencies, as openCL does

Oooh of course that makes sense, let me go through this again in a bit

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp

jandres742 · 2023-08-28T02:11:27Z

Some basic perf results on Geforce 1050Ti:

Benchmark Single Device Context Multi-device-context Speedup
backprop-sycl 0.048693 0.048581 1.002305428048002
gaussian-sycl 99358 99268 1.000906636579764
histogram-sycl 9767.172 9759.001 1.000837278323878
nbody-sycl 1.70932 1.71094 0.9990531520684536
su3-sycl 1.966407 1.979619 0.9933259884856631
adv-sycl 928737 932120 0.9963706389735227
aidw-sycl 0.824413 0.818178 1.007620591118314
all-pairs-distance-sycl 5026.992188 5014.033936 1.002584396548847
aobench-sycl 0.001803 0.001803 1
asmooth-sycl 0.170765 0.17031 1.002671598849157
asta-sycl 0.017901 0.017908 0.9996091132454769
bezier-surface-sycl 111 111 1
bilateral-sycl 28.169319 28.181039 0.9995841175337787
binomial-sycl 8989.490234000001 8973.286133 1.001805815702278
bitonic-sort-sycl 95.02151499999999 95.005402 1.000169600882274
bonds-sycl 986.8431 986.6414 1.000204430910765
bsearch-sycl 0.000153812 0.000152007 1.011874453150184
bspline-vgh-sycl 0.206516 0.207474 0.9953825539585683
burger-sycl 0.734372 0.733031 1.001829390571477
cbsfil-sycl 0.099216 0.099151 1.000655565753245
ccsd-trpdrv-sycl 0.000791 0.000701 1.128388017118402
clenergy-sycl 4.17859 4.187436 0.9978874901013413
cobahh-sycl 555.821716 556.473145 0.9988293612982887
complex-sycl 0.077446 0.077444 1.000025825112339
convolutionSeparable-sycl 0.014627 0.014627 1
crs-sycl 0.036454 0.036363 1.002502543794517
dp-sycl 0.003665 0.003665 1
eigenvalue-sycl 1678.97 1679.49 0.9996903821993581
entropy-sycl 0.017896 0.017893 1.000167663332029
epistasis-sycl 0.00409425 0.00410456 0.9974881595103982
ert-sycl 145.247675 145.222663 1.000172232070968
expdist-sycl 0.274393 0.274343 1.000182253602242
extend2-sycl 2712.250732 2713.319824 0.9996059837876302
fft-sycl 0.080722 0.0807631 0.9994911042295306
fsm-sycl 0.719418 0.719235 1.000254437005986
gpp-sycl 0.11341 0.113404 1.00005290818666
ising-sycl 0.430908 0.431015 0.9997517487790449
iso2dfd-sycl 53.7751 53.8724 0.9981938803543188
jacobi-sycl 3.72 3.716 1.001076426264801
laplace-sycl 4.917153 4.864695 1.010783409854061
lavaMD-sycl 0.086104996502 0.08632799983 0.9974167902831161
merkle-sycl 40380.22 40349.03 1.00077300495204
minimod-sycl 0.863004 0.863306 0.9996501819748733
minisweep-sycl 0.171 0.171 1
mnist-sycl 0.003968 0.003961 1.001767230497349
mr-sycl 29.567129 29.555849 1.000381650346096
murmurhash3-sycl 0.071034 0.070393 1.009106019064396
nbnxm-sycl 385.053436 384.836548 1.000563584724806
page-rank-sycl 0.005491 0.005493 0.9996359002366649
particle-diffusion-sycl 0.000456054 0.00045785 0.9960773178988752
particlefilter-sycl 0.790151 0.7894679999999999 1.000865139562338
pathfinder-sycl 0.07778500000000001 0.07881100000000001 0.9869815127329942
pns-sycl 0.49 0.49 1
s3d-sycl 2.256283 2.256273 1.000004432087783
tridiagonal-sycl 0.02788 0.02791 0.9989251164457184
vanGenuchten-sycl 0.015396 0.015396 1
attention-sycl 28.927914 28.930927 0.9998958553937798
bh-sycl 0.0113 0.0113 1
bn-sycl 11.096324 11.108195 0.9989313295274344
sheath-sycl 5.34 5.35 0.9981308411214954

Mean speedup: 1.0023
Geomean speedup: 1.0022
I don't think we can reasonably say that this is patch is actually giving a speedup, but maybe we can be assured that it is not causing a terrible perf degradation.

@hdelan : were we expecting improvements or regressions in performance, or neither?

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp

hdelan · 2023-08-30T09:25:01Z

@hdelan : were we expecting improvements or regressions in performance, or neither?

We were hopefully expecting neither, although some enqueue operations need to use mutexes now and also urEnqueueKernelLaunch now allocates a new vector of events within the enqueue operation, so there was some fear of perf degradation (another experiment using a cached vector which can be used with mutexes was a lot slower)

JackAKirk · 2023-10-02T15:09:17Z

Fixes #7808

sycl/plugins/unified_runtime/ur/adapters/cuda/image.cpp

The `sycl::context` does not map clearly to a native context for CUDA and HIP backends. This is especially true now that we are adding support for multi device context #10737 . It would be good to start this deprecation process. PRs to oneMKL and oneDNN to follow

jinz2014 · 2024-01-08T18:58:56Z

@sergey-semenov Could you please review the PR ? Thanks.

hdelan · 2024-01-09T10:16:19Z

Hi @jinz2014 this PR is outdated, I will make a new PR for the CUDA adapter in UR in about a month or so

hdelan added 11 commits August 3, 2023 11:34

Multi-device context for CUDA

f2dcab3

Readd retain and release devices

b707e05

Fix race condition

39a5de2

Respond to comments

e69c7cf

Add clear() member func to ur_mem_handle_t

fa2d129

Remove permanent cache of events per kernel

9ddc1b7

Change the naming of ScopedContext to ScopedDevice to align with sema…

e5fc965

…ntics

Change the P2P tests to allow multi device platform

0d73882

Remove const ref

f3661ab

Allow mem to stay resident on device between reads

bfe71fc

Fix bufg

597de28

hdelan requested review from a team as code owners August 8, 2023 15:32

hdelan requested review from npmiller and sergey-semenov August 8, 2023 15:32

Clang format

dd14fd0

hdelan added 6 commits August 8, 2023 17:08

Fix merge

b781a76

Fix failing test

a9243ad

Fix bug

8717cc4

Clang format

da631ad

Fix bug

f745788

Some more comments

d05d625

hdelan closed this Aug 9, 2023

hdelan reopened this Aug 9, 2023

jchlanda reviewed Aug 10, 2023

View reviewed changes

Clang format

9b08ffd

hdelan added 5 commits August 15, 2023 09:33

Add null event check

72cf268

Remove nullptr check

59baa68

Fix race condition

d0a02ca

Merge sycl

42fbacd

Clang format

7e71eba

npmiller reviewed Aug 23, 2023

View reviewed changes

hdelan added 4 commits August 23, 2023 10:46

Use next stream if on same device

756e62a

Move scoped device

fd0f9e7

Use more RAII locking

351aae4

Add Native context bug

6a1e572

hdelan requested a review from a team as a code owner August 24, 2023 14:05

hdelan mentioned this pull request Aug 25, 2023

[SYCL][CUDA][HIP] Deprecate context interop for CUDA and HIP #10975

Merged

npmiller reviewed Aug 28, 2023

View reviewed changes

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved

hdelan added 3 commits August 30, 2023 10:44

Only use mem management if using multi dev context

e1d5d5a

Merge sycl

a065951

Remove unnecessary loop

e66dcda

hdelan mentioned this pull request Sep 7, 2023

[SYCL][UR][HIP] Hip adapter multi device ctx #11105

Closed

Seanst98 reviewed Oct 3, 2023

View reviewed changes

sycl/plugins/unified_runtime/ur/adapters/cuda/image.cpp Show resolved Hide resolved

JackAKirk mentioned this pull request Jan 8, 2024

[HIP] USM-P2P extension implementation oneapi-src/unified-runtime#1234

Closed

hdelan closed this Jan 9, 2024

This was referenced Feb 27, 2024

NVIDIA MultiGPU support for SYCL #6749

Closed

[UR][L0][CUDA][HIP] Add enqueue timestamp recording extension oneapi-src/unified-runtime#1400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Cuda adapter multi device context #10737

[SYCL][CUDA] Cuda adapter multi device context #10737

hdelan commented Aug 8, 2023

hdelan commented Aug 8, 2023 •

edited

Loading

hdelan commented Aug 8, 2023

jchlanda commented Aug 10, 2023

hdelan commented Aug 10, 2023

hdelan commented Aug 14, 2023

npmiller Aug 23, 2023

hdelan Aug 23, 2023

npmiller Aug 23, 2023

hdelan Aug 23, 2023

npmiller Aug 23, 2023

hdelan Aug 23, 2023 •

edited

Loading

npmiller Aug 23, 2023

jandres742 commented Aug 28, 2023

hdelan commented Aug 30, 2023

JackAKirk commented Oct 2, 2023 •

edited

Loading

jinz2014 commented Jan 8, 2024

hdelan commented Jan 9, 2024

[SYCL][CUDA] Cuda adapter multi device context #10737

[SYCL][CUDA] Cuda adapter multi device context #10737

Conversation

hdelan commented Aug 8, 2023

hdelan commented Aug 8, 2023 • edited Loading

hdelan commented Aug 8, 2023

jchlanda commented Aug 10, 2023

hdelan commented Aug 10, 2023

hdelan commented Aug 14, 2023

npmiller Aug 23, 2023

Choose a reason for hiding this comment

hdelan Aug 23, 2023

Choose a reason for hiding this comment

npmiller Aug 23, 2023

Choose a reason for hiding this comment

hdelan Aug 23, 2023

Choose a reason for hiding this comment

npmiller Aug 23, 2023

Choose a reason for hiding this comment

hdelan Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

npmiller Aug 23, 2023

Choose a reason for hiding this comment

jandres742 commented Aug 28, 2023

hdelan commented Aug 30, 2023

JackAKirk commented Oct 2, 2023 • edited Loading

jinz2014 commented Jan 8, 2024

hdelan commented Jan 9, 2024

hdelan commented Aug 8, 2023 •

edited

Loading

hdelan Aug 23, 2023 •

edited

Loading

JackAKirk commented Oct 2, 2023 •

edited

Loading