Remove eager synchronization with HtoD copies. #2625

maleadt · 2025-01-17T13:37:07Z

We assumed unpinned memory would always synchronize, but that does not seem to be the case. For some copy sizes (and potentially on some, e.g. coherent, memory architectures) the copy is fully asynchronous.

This optimization was made to make CuRef of a scalar fully async. I considered making the CuRef ctor call memset instead, which is always asynchronous by virtue of passing the memory by value, however that does not support 64-bits floats while memcpy of 64 bits is still executed fully asynchronously.

Demo script:

using CUDA, NVTX

function main()
    function doit()
        A = NVTX.@range "rand" CUDA.rand(4096, 4096)
        B = NVTX.@range "mul" A*A
        c = NVTX.@range "ref" CuRef{Float64}(1)
        synchronize()
    end

    NVTX.@range "run 1" doit()
    NVTX.@range "run 2" doit()
end

Before:

After:

We assumed unpinned memory would always synchronize, but that does not seem to be the case. For some copy sizes (and potentially on some, e.g. coherent, memory architectures) the copy is fully asynchronous. This optimization was made to make `CuRef` of a scalar fully async. I considered making the `CuRef` ctor call `memset` instead, which is always asynchronous by virtue of passing the memory by value, however that does not support 64-bits floats while `memcpy` of 64 bits is still executed fully asynchronously.

github-actions · 2025-01-17T13:37:40Z

src/array.jl

+    # the copy below may block in `libcuda`, so it'd be good to perform a nonblocking
+    # synchronization here, but the exact cases are hard to know and detect (e.g., unpinned
+    # memory normally blocks, but not for all sizes, and not on all memory architectures).


Suggested change

# the copy below may block in `libcuda`, so it'd be good to perform a nonblocking

# synchronization here, but the exact cases are hard to know and detect (e.g., unpinned

# memory normally blocks, but not for all sizes, and not on all memory architectures).

# the copy below may block in `libcuda`, so it'd be good to perform a nonblocking

# synchronization here, but the exact cases are hard to know and detect (e.g., unpinned

# memory normally blocks, but not for all sizes, and not on all memory architectures).

github-actions · 2025-01-17T13:37:40Z

src/array.jl

-      is_pinned(pointer(dest)) || synchronize()
-    end
-
+    # the copy below may block in `libcuda`; see the note above.


Suggested change

# the copy below may block in `libcuda`; see the note above.

# the copy below may block in `libcuda`; see the note above.

codecov · 2025-01-17T20:08:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.59%. Comparing base (d07a245) to head (9346e4f).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2625      +/-   ##
==========================================
- Coverage   73.60%   73.59%   -0.01%     
==========================================
  Files         157      157              
  Lines       15230    15226       -4     
==========================================
- Hits        11210    11206       -4     
  Misses       4020     4020

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions bot reviewed Jan 17, 2025

View reviewed changes

maleadt mentioned this pull request Jan 17, 2025

RFC: Use non-blocking device side pointer mode in CUBLAS, with fallbacks #2616

Open

maleadt merged commit 3d45d85 into master Jan 17, 2025
3 checks passed

maleadt deleted the tb/async_mempcy branch January 17, 2025 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove eager synchronization with HtoD copies. #2625

Remove eager synchronization with HtoD copies. #2625

maleadt commented Jan 17, 2025

github-actions bot Jan 17, 2025

github-actions bot Jan 17, 2025

codecov bot commented Jan 17, 2025 •

edited

Loading

	# the copy below may block in `libcuda`; see the note above.
	# the copy below may block in `libcuda`; see the note above.

Remove eager synchronization with HtoD copies. #2625

Remove eager synchronization with HtoD copies. #2625

Conversation

maleadt commented Jan 17, 2025

github-actions bot Jan 17, 2025

Choose a reason for hiding this comment

github-actions bot Jan 17, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 17, 2025 • edited Loading

Codecov Report

codecov bot commented Jan 17, 2025 •

edited

Loading