Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Cuda adapter multi device context #10737

Closed
wants to merge 32 commits into from

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Aug 8, 2023

This patch enables a multi device context in CUDA adapter. This also enables a multi-device platform as a side effect. Management of memory across buffers within the same context is now handled in the CUDA adapter.

@hdelan hdelan requested review from a team as code owners August 8, 2023 15:32
@hdelan
Copy link
Contributor Author

hdelan commented Aug 8, 2023

Some basic perf results on Geforce 1050Ti:

Benchmark Single Device Context Multi-device-context Speedup
backprop-sycl 0.048693 0.048581 1.002305428048002
gaussian-sycl 99358 99268 1.000906636579764
histogram-sycl 9767.172 9759.001 1.000837278323878
nbody-sycl 1.70932 1.71094 0.9990531520684536
su3-sycl 1.966407 1.979619 0.9933259884856631
adv-sycl 928737 932120 0.9963706389735227
aidw-sycl 0.824413 0.818178 1.007620591118314
all-pairs-distance-sycl 5026.992188 5014.033936 1.002584396548847
aobench-sycl 0.001803 0.001803 1
asmooth-sycl 0.170765 0.17031 1.002671598849157
asta-sycl 0.017901 0.017908 0.9996091132454769
bezier-surface-sycl 111 111 1
bilateral-sycl 28.169319 28.181039 0.9995841175337787
binomial-sycl 8989.490234000001 8973.286133 1.001805815702278
bitonic-sort-sycl 95.02151499999999 95.005402 1.000169600882274
bonds-sycl 986.8431 986.6414 1.000204430910765
bsearch-sycl 0.000153812 0.000152007 1.011874453150184
bspline-vgh-sycl 0.206516 0.207474 0.9953825539585683
burger-sycl 0.734372 0.733031 1.001829390571477
cbsfil-sycl 0.099216 0.099151 1.000655565753245
ccsd-trpdrv-sycl 0.000791 0.000701 1.128388017118402
clenergy-sycl 4.17859 4.187436 0.9978874901013413
cobahh-sycl 555.821716 556.473145 0.9988293612982887
complex-sycl 0.077446 0.077444 1.000025825112339
convolutionSeparable-sycl 0.014627 0.014627 1
crs-sycl 0.036454 0.036363 1.002502543794517
dp-sycl 0.003665 0.003665 1
eigenvalue-sycl 1678.97 1679.49 0.9996903821993581
entropy-sycl 0.017896 0.017893 1.000167663332029
epistasis-sycl 0.00409425 0.00410456 0.9974881595103982
ert-sycl 145.247675 145.222663 1.000172232070968
expdist-sycl 0.274393 0.274343 1.000182253602242
extend2-sycl 2712.250732 2713.319824 0.9996059837876302
fft-sycl 0.080722 0.0807631 0.9994911042295306
fsm-sycl 0.719418 0.719235 1.000254437005986
gpp-sycl 0.11341 0.113404 1.00005290818666
ising-sycl 0.430908 0.431015 0.9997517487790449
iso2dfd-sycl 53.7751 53.8724 0.9981938803543188
jacobi-sycl 3.72 3.716 1.001076426264801
laplace-sycl 4.917153 4.864695 1.010783409854061
lavaMD-sycl 0.086104996502 0.08632799983 0.9974167902831161
merkle-sycl 40380.22 40349.03 1.00077300495204
minimod-sycl 0.863004 0.863306 0.9996501819748733
minisweep-sycl 0.171 0.171 1
mnist-sycl 0.003968 0.003961 1.001767230497349
mr-sycl 29.567129 29.555849 1.000381650346096
murmurhash3-sycl 0.071034 0.070393 1.009106019064396
nbnxm-sycl 385.053436 384.836548 1.000563584724806
page-rank-sycl 0.005491 0.005493 0.9996359002366649
particle-diffusion-sycl 0.000456054 0.00045785 0.9960773178988752
particlefilter-sycl 0.790151 0.7894679999999999 1.000865139562338
pathfinder-sycl 0.07778500000000001 0.07881100000000001 0.9869815127329942
pns-sycl 0.49 0.49 1
s3d-sycl 2.256283 2.256273 1.000004432087783
tridiagonal-sycl 0.02788 0.02791 0.9989251164457184
vanGenuchten-sycl 0.015396 0.015396 1
attention-sycl 28.927914 28.930927 0.9998958553937798
bh-sycl 0.0113 0.0113 1
bn-sycl 11.096324 11.108195 0.9989313295274344
sheath-sycl 5.34 5.35 0.9981308411214954
---- --- ---- ---
Mean speedup: 1.0023
Geomean speedup: 1.0022

I don't think we can reasonably say that this is patch is actually giving a speedup, but maybe we can be assured that it is not causing a terrible perf degradation.

@hdelan
Copy link
Contributor Author

hdelan commented Aug 8, 2023

This patch also changes the ScopedContext to ScopedDevice, in line with #10672

@hdelan hdelan closed this Aug 9, 2023
@hdelan hdelan reopened this Aug 9, 2023
sycl/plugins/unified_runtime/ur/adapters/cuda/context.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/context.hpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/event.hpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/memory.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/memory.cpp Outdated Show resolved Hide resolved
@jchlanda
Copy link
Contributor

I think folding in of the rename of ScopedContext introduced a lot more noise, and should have been avoided, especially that it has a dedicated PR (#10672). Just a side note, nothing major.

@hdelan
Copy link
Contributor Author

hdelan commented Aug 10, 2023

I think folding in of the rename of ScopedContext introduced a lot more noise, and should have been avoided, especially that it has a dedicated PR (#10672). Just a side note, nothing major.

Yeah I agree but semantic changes were happening in this PR since the 1-to-1 mapping of ur_context_handle_t_ to native context was being removed for the first time. It seems to make sense to change the naming to reflect this

@hdelan
Copy link
Contributor Author

hdelan commented Aug 14, 2023

We'll also need this for HIP, do you think there's anything we can re-use there or will we just have to mirror the changes?

We should be able to reuse a lot of the logic. Will start working on HIP version as soon as this merges

UR_ASSERT(phNativeContext, UR_RESULT_ERROR_INVALID_NULL_POINTER);

*phNativeContext =
reinterpret_cast<ur_native_handle_t>(hContext->getDevices()[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the ur_native_handle_t a list of context here and do the [0] in the SYCL runtime (I think we still have two interop interfaces for CUDA and one of them returns a list of contexts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sure

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
ur_cast<ur_buffer_ *>(MemArg.Mem)->LastEventWritingToMemObj;
MemDepEvent && std::find(DepEvents.begin(), DepEvents.end(),
MemDepEvent) == DepEvents.end()) {
DepEvents.push_back(MemDepEvent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CUDA streams are in-order would it be simpler to just enqueue the copies on the same stream as the commands? That way we might be able to avoid dealing with extra events and synchronization and just piggy back on the regular command's events and synchronization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can enqueue the copies on the same stream but in the case that depEvents are kernels on a different device we can't rely on streams. We need to use CUEvents instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be covered by the regular command's events?

Copy link
Contributor Author

@hdelan hdelan Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the regular commands events are only the ones that are passed to the adapter by the RT, which may be from a queue from a different SYCL context. Since we are unifying the context then the SYCL RT knows that it doesn't need to use dep events within the same context because the PI will handle these dependencies, as openCL does

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh of course that makes sense, let me go through this again in a bit

sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp Outdated Show resolved Hide resolved
@jandres742
Copy link
Contributor

Some basic perf results on Geforce 1050Ti:

Benchmark Single Device Context Multi-device-context Speedup
backprop-sycl 0.048693 0.048581 1.002305428048002
gaussian-sycl 99358 99268 1.000906636579764
histogram-sycl 9767.172 9759.001 1.000837278323878
nbody-sycl 1.70932 1.71094 0.9990531520684536
su3-sycl 1.966407 1.979619 0.9933259884856631
adv-sycl 928737 932120 0.9963706389735227
aidw-sycl 0.824413 0.818178 1.007620591118314
all-pairs-distance-sycl 5026.992188 5014.033936 1.002584396548847
aobench-sycl 0.001803 0.001803 1
asmooth-sycl 0.170765 0.17031 1.002671598849157
asta-sycl 0.017901 0.017908 0.9996091132454769
bezier-surface-sycl 111 111 1
bilateral-sycl 28.169319 28.181039 0.9995841175337787
binomial-sycl 8989.490234000001 8973.286133 1.001805815702278
bitonic-sort-sycl 95.02151499999999 95.005402 1.000169600882274
bonds-sycl 986.8431 986.6414 1.000204430910765
bsearch-sycl 0.000153812 0.000152007 1.011874453150184
bspline-vgh-sycl 0.206516 0.207474 0.9953825539585683
burger-sycl 0.734372 0.733031 1.001829390571477
cbsfil-sycl 0.099216 0.099151 1.000655565753245
ccsd-trpdrv-sycl 0.000791 0.000701 1.128388017118402
clenergy-sycl 4.17859 4.187436 0.9978874901013413
cobahh-sycl 555.821716 556.473145 0.9988293612982887
complex-sycl 0.077446 0.077444 1.000025825112339
convolutionSeparable-sycl 0.014627 0.014627 1
crs-sycl 0.036454 0.036363 1.002502543794517
dp-sycl 0.003665 0.003665 1
eigenvalue-sycl 1678.97 1679.49 0.9996903821993581
entropy-sycl 0.017896 0.017893 1.000167663332029
epistasis-sycl 0.00409425 0.00410456 0.9974881595103982
ert-sycl 145.247675 145.222663 1.000172232070968
expdist-sycl 0.274393 0.274343 1.000182253602242
extend2-sycl 2712.250732 2713.319824 0.9996059837876302
fft-sycl 0.080722 0.0807631 0.9994911042295306
fsm-sycl 0.719418 0.719235 1.000254437005986
gpp-sycl 0.11341 0.113404 1.00005290818666
ising-sycl 0.430908 0.431015 0.9997517487790449
iso2dfd-sycl 53.7751 53.8724 0.9981938803543188
jacobi-sycl 3.72 3.716 1.001076426264801
laplace-sycl 4.917153 4.864695 1.010783409854061
lavaMD-sycl 0.086104996502 0.08632799983 0.9974167902831161
merkle-sycl 40380.22 40349.03 1.00077300495204
minimod-sycl 0.863004 0.863306 0.9996501819748733
minisweep-sycl 0.171 0.171 1
mnist-sycl 0.003968 0.003961 1.001767230497349
mr-sycl 29.567129 29.555849 1.000381650346096
murmurhash3-sycl 0.071034 0.070393 1.009106019064396
nbnxm-sycl 385.053436 384.836548 1.000563584724806
page-rank-sycl 0.005491 0.005493 0.9996359002366649
particle-diffusion-sycl 0.000456054 0.00045785 0.9960773178988752
particlefilter-sycl 0.790151 0.7894679999999999 1.000865139562338
pathfinder-sycl 0.07778500000000001 0.07881100000000001 0.9869815127329942
pns-sycl 0.49 0.49 1
s3d-sycl 2.256283 2.256273 1.000004432087783
tridiagonal-sycl 0.02788 0.02791 0.9989251164457184
vanGenuchten-sycl 0.015396 0.015396 1
attention-sycl 28.927914 28.930927 0.9998958553937798
bh-sycl 0.0113 0.0113 1
bn-sycl 11.096324 11.108195 0.9989313295274344
sheath-sycl 5.34 5.35 0.9981308411214954


Mean speedup: 1.0023
Geomean speedup: 1.0022
I don't think we can reasonably say that this is patch is actually giving a speedup, but maybe we can be assured that it is not causing a terrible perf degradation.

@hdelan : were we expecting improvements or regressions in performance, or neither?

@hdelan
Copy link
Contributor Author

hdelan commented Aug 30, 2023

@hdelan : were we expecting improvements or regressions in performance, or neither?

We were hopefully expecting neither, although some enqueue operations need to use mutexes now and also urEnqueueKernelLaunch now allocates a new vector of events within the enqueue operation, so there was some fear of perf degradation (another experiment using a cached vector which can be used with mutexes was a lot slower)

@JackAKirk
Copy link
Contributor

JackAKirk commented Oct 2, 2023

Fixes #7808

againull pushed a commit that referenced this pull request Oct 23, 2023
The `sycl::context` does not map clearly to a native context for CUDA
and HIP backends. This is especially true now that we are adding support
for multi device context #10737 . It
would be good to start this deprecation process. PRs to oneMKL and
oneDNN to follow
@jinz2014
Copy link
Contributor

jinz2014 commented Jan 8, 2024

@sergey-semenov Could you please review the PR ? Thanks.

@hdelan
Copy link
Contributor Author

hdelan commented Jan 9, 2024

Hi @jinz2014 this PR is outdated, I will make a new PR for the CUDA adapter in UR in about a month or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants