Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap and test peer to peer memory copies #1284

Merged
merged 2 commits into from
Jan 26, 2022
Merged

Wrap and test peer to peer memory copies #1284

merged 2 commits into from
Jan 26, 2022

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Dec 19, 2021

Fixes #1136. We might need a check that the GPUs in question support peer copies?

@kshyatt kshyatt requested a review from maleadt December 19, 2021 16:50
@codecov
Copy link

codecov bot commented Dec 19, 2021

Codecov Report

Merging #1284 (f1beb21) into master (64c5ca8) will increase coverage by 0.34%.
The diff coverage is 20.00%.

❗ Current head f1beb21 differs from pull request most recent head b143aff. Consider uploading reports for the commit b143aff to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
+ Coverage   78.29%   78.63%   +0.34%     
==========================================
  Files         119      119              
  Lines        8648     8642       -6     
==========================================
+ Hits         6771     6796      +25     
+ Misses       1877     1846      -31     
Impacted Files Coverage Δ
lib/cudadrv/memory.jl 79.52% <20.00%> (-4.50%) ⬇️
lib/cudadrv/CUDAdrv.jl 47.05% <0.00%> (-31.28%) ⬇️
src/initialization.jl 65.11% <0.00%> (-4.66%) ⬇️
lib/cudadrv/module.jl 71.87% <0.00%> (-0.86%) ⬇️
src/linalg.jl 80.00% <0.00%> (-0.71%) ⬇️
src/compiler/reflection.jl 93.33% <0.00%> (-0.70%) ⬇️
src/sorting.jl 24.14% <0.00%> (-0.35%) ⬇️
src/texture.jl 88.63% <0.00%> (-0.26%) ⬇️
lib/cudadrv/stream.jl 93.44% <0.00%> (-0.11%) ⬇️
lib/cudadrv/pool.jl 96.96% <0.00%> (-0.09%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c5ca8...b143aff. Read the comment docs.

@maleadt
Copy link
Member

maleadt commented Dec 20, 2021

cuDeviceCanAccessPeer? Maybe that then requires cuCtxEnablePeerAccess?

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 17, 2022

OK, so some offline discussion showed that if both devices have UVA enabled, we can just use cudaMemcpy after enabling peer access. Otherwise, we have to use the cuMemcpyPeerAsync.

@maleadt
Copy link
Member

maleadt commented Jan 17, 2022

Nice! I'll try this out later this week. We should add a multigpu CI bot to properly test this. I take it this doesn't depend on P2P?

I also wonder if we should just require UVA -- are there systems without it? -- and then just cuMemcpy, letting CUDA figure out all the pesky details.

@kshyatt
Copy link
Contributor Author

kshyatt commented Jan 17, 2022

It looks like UVA was added in CUDA 6 (2013) so I'd be very surprised if anyone were still using GPUs that don't support it, but I don't know.

@maleadt
Copy link
Member

maleadt commented Jan 24, 2022

Even a Jetson Nano supports it:

julia> dev = device()
CuDevice(0): NVIDIA Tegra X1

julia> attribute(dev, CUDA.DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING)
1

I also want to verify on Windows whether it's only supported on TCC or not.

@maleadt
Copy link
Member

maleadt commented Jan 24, 2022

On Windows too:

julia> dev = device()
CuDevice(0): NVIDIA GeForce GTX 970

julia> attribute(dev, CUDA.DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING)
1

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

So I propose we switch to mandatory UVA and simplify our copy methods as part of this PR. I can have a look at this.

@maleadt
Copy link
Member

maleadt commented Jan 25, 2022

Let's hold off on requiring UVA.

I've had a look at the PR, but am running into a couple of issues. For one, copies between devices that have UVA but not P2P doesn't work:

julia> device(a)
CuDevice(0): NVIDIA A100-PCIE-40GB

julia> device(b)
CuDevice(1): Tesla V100-PCIE-32GB

julia> unified_addressing(device(a))
true

julia> unified_addressing(device(b))
true

julia> copyto!(a, b)
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:91
  [2] macro expansion
    @ ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:101 [inlined]
  [3] cuMemcpyAsync(dst::CuPtr{Float32}, src::CuPtr{Float32}, ByteCount::Int64, hStream::CuStream)
    @ CUDA ~/Julia/pkg/CUDA/lib/utils/call.jl:26

@kshyatt did you test this?

To use the Peer version of the call, this PR currently does context(::CuPtr), but that doesn't work with allocations from the stream-ordered allocator (where the returned context is NULL). We can query the owning device though, but device-to-context translation is currently handled by CUDA.jl, and not its CUDAdrv.jl submodule (because I wanted to retain raw context management). Maybe I should revisit that...

@maleadt
Copy link
Member

maleadt commented Jan 26, 2022

I'm thoroughly confused now; it's pretty badly documented whether cu(da)Memcpy can work with multiple devices, and some testing here on UVA as well as P2P capable GPUs doesn't seem to indicate so. I've just made it use the Peer version now, which is also what the sample does: https://github.com/zchee/cuda-sample/blob/05555eef0d49ebdde999f5430f185a225ef00dcd/1_Utilities/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu#L120

Co-authored-by: Katharine Hyatt <khyatt@flatironinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Moving data between devices
2 participants