Wrap and test peer to peer memory copies #1284

kshyatt · 2021-12-19T16:50:25Z

Fixes #1136. We might need a check that the GPUs in question support peer copies?

codecov · 2021-12-19T19:36:32Z

Codecov Report

Merging #1284 (f1beb21) into master (64c5ca8) will increase coverage by 0.34%.
The diff coverage is 20.00%.

❗ Current head f1beb21 differs from pull request most recent head b143aff. Consider uploading reports for the commit b143aff to get more accurate results

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
+ Coverage   78.29%   78.63%   +0.34%     
==========================================
  Files         119      119              
  Lines        8648     8642       -6     
==========================================
+ Hits         6771     6796      +25     
+ Misses       1877     1846      -31

Impacted Files	Coverage Δ
lib/cudadrv/memory.jl	`79.52% <20.00%> (-4.50%)`	⬇️
lib/cudadrv/CUDAdrv.jl	`47.05% <0.00%> (-31.28%)`	⬇️
src/initialization.jl	`65.11% <0.00%> (-4.66%)`	⬇️
lib/cudadrv/module.jl	`71.87% <0.00%> (-0.86%)`	⬇️
src/linalg.jl	`80.00% <0.00%> (-0.71%)`	⬇️
src/compiler/reflection.jl	`93.33% <0.00%> (-0.70%)`	⬇️
src/sorting.jl	`24.14% <0.00%> (-0.35%)`	⬇️
src/texture.jl	`88.63% <0.00%> (-0.26%)`	⬇️
lib/cudadrv/stream.jl	`93.44% <0.00%> (-0.11%)`	⬇️
lib/cudadrv/pool.jl	`96.96% <0.00%> (-0.09%)`	⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c5ca8...b143aff. Read the comment docs.

maleadt · 2021-12-20T06:54:36Z

cuDeviceCanAccessPeer? Maybe that then requires cuCtxEnablePeerAccess?

kshyatt · 2022-01-17T19:27:22Z

OK, so some offline discussion showed that if both devices have UVA enabled, we can just use cudaMemcpy after enabling peer access. Otherwise, we have to use the cuMemcpyPeerAsync.

maleadt · 2022-01-17T20:42:59Z

Nice! I'll try this out later this week. We should add a multigpu CI bot to properly test this. I take it this doesn't depend on P2P?

I also wonder if we should just require UVA -- are there systems without it? -- and then just cuMemcpy, letting CUDA figure out all the pesky details.

kshyatt · 2022-01-17T20:49:45Z

It looks like UVA was added in CUDA 6 (2013) so I'd be very surprised if anyone were still using GPUs that don't support it, but I don't know.

maleadt · 2022-01-24T16:16:07Z

Even a Jetson Nano supports it:

julia> dev = device()
CuDevice(0): NVIDIA Tegra X1

julia> attribute(dev, CUDA.DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING)
1

I also want to verify on Windows whether it's only supported on TCC or not.

maleadt · 2022-01-24T16:39:40Z

On Windows too:

julia> dev = device()
CuDevice(0): NVIDIA GeForce GTX 970

julia> attribute(dev, CUDA.DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING)
1

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

So I propose we switch to mandatory UVA and simplify our copy methods as part of this PR. I can have a look at this.

maleadt · 2022-01-25T14:48:49Z

Let's hold off on requiring UVA.

I've had a look at the PR, but am running into a couple of issues. For one, copies between devices that have UVA but not P2P doesn't work:

julia> device(a)
CuDevice(0): NVIDIA A100-PCIE-40GB

julia> device(b)
CuDevice(1): Tesla V100-PCIE-32GB

julia> unified_addressing(device(a))
true

julia> unified_addressing(device(b))
true

julia> copyto!(a, b)
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:91
  [2] macro expansion
    @ ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:101 [inlined]
  [3] cuMemcpyAsync(dst::CuPtr{Float32}, src::CuPtr{Float32}, ByteCount::Int64, hStream::CuStream)
    @ CUDA ~/Julia/pkg/CUDA/lib/utils/call.jl:26

@kshyatt did you test this?

To use the Peer version of the call, this PR currently does context(::CuPtr), but that doesn't work with allocations from the stream-ordered allocator (where the returned context is NULL). We can query the owning device though, but device-to-context translation is currently handled by CUDA.jl, and not its CUDAdrv.jl submodule (because I wanted to retain raw context management). Maybe I should revisit that...

maleadt · 2022-01-26T12:35:26Z

I'm thoroughly confused now; it's pretty badly documented whether cu(da)Memcpy can work with multiple devices, and some testing here on UVA as well as P2P capable GPUs doesn't seem to indicate so. I've just made it use the Peer version now, which is also what the sample does: https://github.com/zchee/cuda-sample/blob/05555eef0d49ebdde999f5430f185a225ef00dcd/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu#L120

Co-authored-by: Katharine Hyatt <khyatt@flatironinstitute.org>

kshyatt requested a review from maleadt December 19, 2021 16:50

kshyatt force-pushed the ksh/peer branch from 8a3b4da to f1beb21 Compare January 17, 2022 19:26

Wrap driver functions for P2P and UVA management.

a53f32a

maleadt force-pushed the ksh/peer branch from f1beb21 to b143aff Compare January 26, 2022 12:33

maleadt approved these changes Jan 26, 2022

View reviewed changes

Special-case D2D copyto to support P2P.

6576f41

Co-authored-by: Katharine Hyatt <khyatt@flatironinstitute.org>

maleadt force-pushed the ksh/peer branch from b143aff to 6576f41 Compare January 26, 2022 15:01

maleadt merged commit 0c9d886 into master Jan 26, 2022

maleadt deleted the ksh/peer branch January 26, 2022 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap and test peer to peer memory copies #1284

Wrap and test peer to peer memory copies #1284

kshyatt commented Dec 19, 2021 •

edited by maleadt

Loading

codecov bot commented Dec 19, 2021 •

edited

Loading

maleadt commented Dec 20, 2021

kshyatt commented Jan 17, 2022

maleadt commented Jan 17, 2022

kshyatt commented Jan 17, 2022

maleadt commented Jan 24, 2022

maleadt commented Jan 24, 2022

maleadt commented Jan 25, 2022

maleadt commented Jan 26, 2022

Wrap and test peer to peer memory copies #1284

Wrap and test peer to peer memory copies #1284

Conversation

kshyatt commented Dec 19, 2021 • edited by maleadt Loading

codecov bot commented Dec 19, 2021 • edited Loading

Codecov Report

maleadt commented Dec 20, 2021

kshyatt commented Jan 17, 2022

maleadt commented Jan 17, 2022

kshyatt commented Jan 17, 2022

maleadt commented Jan 24, 2022

maleadt commented Jan 24, 2022

maleadt commented Jan 25, 2022

maleadt commented Jan 26, 2022

kshyatt commented Dec 19, 2021 •

edited by maleadt

Loading

codecov bot commented Dec 19, 2021 •

edited

Loading