-
Notifications
You must be signed in to change notification settings - Fork 758
Add execution policy thrust::cuda::par_nosync
#1568
Conversation
Can one of the admins verify this patch? |
Thanks for the PR! I'll review it soon.
I believe it should behave the same as on host. Is there a reason that it can't? |
291de6e
to
fd96c92
Compare
I guess it's just that I am not very familiar with dynamic parallelism. For example, I have never seen However, I got it working and also found a bug. Synchronization can't be skipped when followed by call to This of course means that there is an "unneccessary" synchronization call in the host path. Maybe one could instead add synchronization in the device-path of Performance seems to be improved with dynamic parallelism the same as on the host judging from the kernel execution time. But my profiler does not show in-kernel synchronization events so its hard to tell for certain whether synchronization is skipped or not. |
This is great, thanks for the PR. |
@allisonvacanti we should probably prioritize this fairly highly, as a lot of folks have been asking for this. @fkallen does this include documentation for this as well? |
@brycelelbach Currently it's only code changes. No separate documentation, and no tests either except for the toy example above. |
I'll do a full review soon, hopefully this week. I'm planning to include this in the next release.
We should avoid syncing for CDP +
Good catch! I think most folks are interested in using this policy with
The existing execution policies are documented here: https://github.com/NVIDIA/thrust/blob/main/thrust/execution_policy.h I think it would sufficient to just say that |
Just to be clear, we don't need to expose the |
LGTM -- thanks for submitting this! I tested it out locally and it works nicely. I'll start CI and get this merged in the next week so. DVS CL: 30735025 run tests |
…from a thrust call before the kernels have completed
f980eb1
to
3dfdcff
Compare
Rebased to resolve conflicts -- this is ready to merge! Thanks again @fkallen! |
Thank you! |
Version 1.16 of Thrust adds policy thrust::cuda::par_nosync, which accepts a stream argument and does not synchronize, thus preventing a stall waiting for the CPU to learn the kernel has completed before launching its next operation. NVIDIA/thrust#1568 This feature (not blocking for kernels that don't need to) had been removed (breaking change) in Thrust-1.9.4 to simplify error handling behavior and because a futures-based async interface had been deemed sufficient. This issue describes the history and rationale for the new par_nosync feature. NVIDIA/thrust#1515
This PR adds functionality requested in #1515
cuda_cub::synchronize_optional(policy)
which may or may not synchronize the stream, depending on the policythrust::cuda::par_nosync
which does not perform optional synchronizationcuda_cub::synchronize
at the end of an algorithm bycuda_cub::synchronize_optional
Open question:
synchronize_optional
currently does not skip synchronization in device code. Should it stay like this?I profiled the following example program to verify that no optional synchronization is performed.
(edit: updated code)