Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Add execution policy thrust::cuda::par_nosync #1568

Merged
merged 5 commits into from
Dec 14, 2021

Conversation

fkallen
Copy link
Contributor

@fkallen fkallen commented Nov 12, 2021

This PR adds functionality requested in #1515

  • add cuda_cub::synchronize_optional(policy) which may or may not synchronize the stream, depending on the policy
  • add execution policy thrust::cuda::par_nosync which does not perform optional synchronization
  • replace each call to cuda_cub::synchronize at the end of an algorithm by cuda_cub::synchronize_optional

Open question: synchronize_optional currently does not skip synchronization in device code. Should it stay like this?

I profiled the following example program to verify that no optional synchronization is performed.

#include <iostream>
#include <cstdint>
#include <cassert>

#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <thrust/reverse.h>
#include <thrust/execution_policy.h>

template<class ExecPolicy>
void executeHost(ExecPolicy policy, std::size_t N){
    thrust::device_vector<std::size_t> d_vec(N);
    thrust::sequence(policy, d_vec.begin(), d_vec.end(), 1);
    thrust::reverse(policy, d_vec.begin(), d_vec.end());

    constexpr int numiters = 15;
    for(int i = 0; i < numiters; i++){
        thrust::for_each(policy, d_vec.begin(), d_vec.end(), [] __host__ __device__ (std::size_t& x){ x = x + 1;});
    }

    std::size_t x = thrust::reduce(policy, d_vec.begin(), d_vec.end());

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());
    std::size_t expected = (numiters * N) + (N * (N+1)) / 2;
    assert(x == expected);
}

__global__ 
void synckernel(std::size_t* d_vec, std::size_t N){
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

    auto policy = thrust::cuda::par.on(stream);
    thrust::sequence(policy, d_vec, d_vec + N, 1);
    thrust::reverse(policy, d_vec, d_vec + N);

    constexpr int numiters = 15;
    for(int i = 0; i < numiters; i++){
        thrust::for_each(policy, d_vec, d_vec + N, [] (std::size_t& x){ x = x + 1;});
    }

    std::size_t x = thrust::reduce(policy, d_vec, d_vec + N);

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());
    std::size_t expected = (numiters * N) + (N * (N+1)) / 2;
    assert(x == expected);

    cudaStreamDestroy(stream);
}

__global__ 
void nosynckernel(std::size_t* d_vec, std::size_t N){
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

    auto policy = thrust::cuda::par_nosync.on(stream);
    thrust::sequence(policy, d_vec, d_vec + N, 1);
    thrust::reverse(policy, d_vec, d_vec + N);

    constexpr int numiters = 15;
    for(int i = 0; i < numiters; i++){
        thrust::for_each(policy, d_vec, d_vec + N, [] (std::size_t& x){ x = x + 1;});
    }

    std::size_t x = thrust::reduce(policy, d_vec, d_vec + N);

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());
    std::size_t expected = (numiters * N) + (N * (N+1)) / 2;
    assert(x == expected);

    cudaStreamDestroy(stream);
}

void executesynckernel(std::size_t N, cudaStream_t stream){
    thrust::device_vector<std::size_t> d_vec(N);

    synckernel<<<1,1, 0, stream>>>(
        thrust::raw_pointer_cast(d_vec.data()),
        N
    );

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());
}

void executenosynckernel(std::size_t N, cudaStream_t stream){
    thrust::device_vector<std::size_t> d_vec(N);

    nosynckernel<<<1,1, 0, stream>>>(
        thrust::raw_pointer_cast(d_vec.data()),
        N
    );

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());
}

int main(){
    std::size_t N = 1'000'000;

    cudaStream_t stream = cudaStreamPerThread;

    auto blockingpolicy = thrust::cuda::par.on(stream);
    executeHost(blockingpolicy, N);

    auto nosyncpolicy = thrust::cuda::par_nosync.on(stream);
    executeHost(nosyncpolicy, N);

    auto nosyncpolicydefaultstream = thrust::cuda::par_nosync;
    executeHost(nosyncpolicydefaultstream, N);

    cudaDeviceSynchronize();
    assert(cudaSuccess == cudaGetLastError());

    executenosynckernel(N, stream);
    executesynckernel(N, stream);
}

(edit: updated code)
Screenshot_thrust_nosync

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@alliepiper
Copy link
Collaborator

Thanks for the PR! I'll review it soon.

Open question: synchronize_optional currently does not skip synchronization in device code. Should it stay like this?

I believe it should behave the same as on host. Is there a reason that it can't?

@alliepiper alliepiper linked an issue Nov 15, 2021 that may be closed by this pull request
@alliepiper alliepiper self-assigned this Nov 15, 2021
@alliepiper alliepiper added the P1: should have Necessary, but not critical. label Nov 15, 2021
@alliepiper alliepiper added this to the 1.16.0 milestone Nov 15, 2021
@fkallen
Copy link
Contributor Author

fkallen commented Nov 22, 2021

I guess it's just that I am not very familiar with dynamic parallelism. For example, I have never seen __cudaDeviceSynchronizeDeprecationAvoidance before.

However, I got it working and also found a bug. Synchronization can't be skipped when followed by call to get_value. While on the host stream-ordering will ensure that the result has been calculated before the transfer, that may not be the case on the device since it's only dereferencing a pointer. So I reverted to the usual synchronization method for the corresponding algorithms.

This of course means that there is an "unneccessary" synchronization call in the host path. Maybe one could instead add synchronization in the device-path of get_value before dereferencing but I cannot tell the implications of this so I am leaving it as is.

Performance seems to be improved with dynamic parallelism the same as on the host judging from the kernel execution time. But my profiler does not show in-kernel synchronization events so its hard to tell for certain whether synchronization is skipped or not.

thrust_nosync_dp

@brycelelbach
Copy link
Collaborator

This is great, thanks for the PR.

@brycelelbach
Copy link
Collaborator

@allisonvacanti we should probably prioritize this fairly highly, as a lot of folks have been asking for this.

@fkallen does this include documentation for this as well?

@fkallen
Copy link
Contributor Author

fkallen commented Nov 24, 2021

@brycelelbach Currently it's only code changes. No separate documentation, and no tests either except for the toy example above.
Is there documentation for the usual thrust::cuda::par which I could use as reference ?

@alliepiper
Copy link
Collaborator

I'll do a full review soon, hopefully this week. I'm planning to include this in the next release.

I guess it's just that I am not very familiar with dynamic parallelism. For example, I have never seen __cudaDeviceSynchronizeDeprecationAvoidance before.

We should avoid syncing for CDP + par_nosync usecases. The "DeprecationAvoidance" function just supports some testing usecases, you can safely ignore it -- just call the cuda::detail::device_synchronize() method and it should do the right thing.

However, I got it working and also found a bug. Synchronization can't be skipped when followed by call to get_value. While on the host stream-ordering will ensure that the result has been calculated before the transfer, that may not be the case on the device since it's only dereferencing a pointer. So I reverted to the usual synchronization method for the corresponding algorithms.

This of course means that there is an "unneccessary" synchronization call in the host path. Maybe one could instead add synchronization in the device-path of get_value before dereferencing but I cannot tell the implications of this so I am leaving it as is.

Good catch! I think most folks are interested in using this policy with thrust::transform, so this workaround sounds good to me. We can look at further optimizations later if needed.

Is there documentation for the usual thrust::cuda::par which I could use as reference?

The existing execution policies are documented here: https://github.com/NVIDIA/thrust/blob/main/thrust/execution_policy.h

I think it would sufficient to just say that par_nosync behaves like par, but with some exceptions, and list out the ways that it is different. Consider adding a new example that demonstrates how to safely use this, too, that would be very useful.

@alliepiper
Copy link
Collaborator

The existing execution policies are documented here

Just to be clear, we don't need to expose the par_nosync policy in thrust/execution_policy.h, it can stay in thrust/system/cuda/... and just needs a doxygen comment.

@alliepiper
Copy link
Collaborator

LGTM -- thanks for submitting this! I tested it out locally and it works nicely.

I'll start CI and get this merged in the next week so.

DVS CL: 30735025

run tests

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Dec 3, 2021
@alliepiper
Copy link
Collaborator

Rebased to resolve conflicts -- this is ready to merge! Thanks again @fkallen!

@alliepiper alliepiper merged commit a462ff6 into NVIDIA:main Dec 14, 2021
@jedbrown
Copy link

Thank you!

petscbot pushed a commit to petsc/petsc that referenced this pull request Jan 4, 2022
Version 1.16 of Thrust adds policy thrust::cuda::par_nosync, which
accepts a stream argument and does not synchronize, thus preventing a
stall waiting for the CPU to learn the kernel has completed before
launching its next operation.

NVIDIA/thrust#1568

This feature (not blocking for kernels that don't need to) had been
removed (breaking change) in Thrust-1.9.4 to simplify error handling
behavior and because a futures-based async interface had been deemed
sufficient. This issue describes the history and rationale for the new
par_nosync feature.

NVIDIA/thrust#1515
@upsj upsj mentioned this pull request Aug 3, 2022
3 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add thrust::cuda::par_nosync execution policy
5 participants