Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STF] reduce access mode #2830

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

caugonnet
Copy link
Contributor

Description

closes

This PR intends to introduce a reduction access mode to make it much easier to write parallel_for kernels which also perform some reductions to a logical data.

Checklist

  • [ x] New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Nov 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@caugonnet
Copy link
Contributor Author

/ok to test

@@ -423,7 +452,7 @@ public:
Fun&& f = mv(::std::get<2>(*p));
const sub_shape_t& shape = ::std::get<3>(*p);

auto explode_coords = [&](size_t i, deps_t... data) {
auto explode_coords = [&](size_t i, typename deps_ops_t::first_type... data) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a comment to say what that type is (we can't use an alias because it's a pack, not a tuple)

public:
// no-op operator
template <typename T>
static __host__ __device__ void apply_op(T&, const T&)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not need that

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Nov 21, 2024
@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

// arguments, or an owning local variable for reduction variables.
// extern __shared__ redux_buffer_tup_wrapper<tuple_args, tuple_ops> per_block_redux_buffer[];
extern __shared__ char dyn_buffer[];
auto* per_block_redux_buffer = (redux_buffer_tup_wrapper<tuple_args, tuple_ops>*) ((void*) dyn_buffer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this weirdness is due to the fact that external symbols won't work with different type for the same symbol

// Write the block's result to the output array
if (tid == 0)
{
tuple_set_op<tuple_ops>(redux_buffer[blockIdx.x], per_block_redux_buffer[0].get());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialize if only one block...

@caugonnet
Copy link
Contributor Author

/ok to test

1 similar comment
@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

@caugonnet
Copy link
Contributor Author

/ok to test

Copy link
Contributor

🟨 CI finished in 32m 02s: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123
  • 🟨 cudax: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/50  | Total:  9h 57m | Avg: 11m 56s | Max: 16m 04s | Hits:  90%/123   
      🟩 arm64              Pass: 100%/4   | Total: 43m 23s | Avg: 10m 50s | Max: 11m 38s
    🟨 ctk
      🟨 12.0               Pass:  84%/19  | Total:  3h 44m | Avg: 11m 48s | Max: 14m 42s
      🟩 12.5               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
      🟨 12.6               Pass:  90%/33  | Total:  6h 45m | Avg: 12m 17s | Max: 16m 04s | Hits:  90%/123   
    🟨 cudacxx
      🟨 nvcc12.0           Pass:  84%/19  | Total:  3h 44m | Avg: 11m 48s | Max: 14m 42s
      🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
      🟨 nvcc12.6           Pass:  90%/33  | Total:  6h 45m | Avg: 12m 17s | Max: 16m 04s | Hits:  90%/123   
    🟨 cxx
      🟩 Clang9             Pass: 100%/2   | Total: 23m 10s | Avg: 11m 35s | Max: 12m 19s
      🟩 Clang10            Pass: 100%/2   | Total: 23m 34s | Avg: 11m 47s | Max: 12m 26s
      🟩 Clang11            Pass: 100%/4   | Total: 45m 48s | Avg: 11m 27s | Max: 11m 46s
      🟩 Clang12            Pass: 100%/4   | Total: 47m 40s | Avg: 11m 55s | Max: 12m 22s
      🟩 Clang13            Pass: 100%/4   | Total: 47m 33s | Avg: 11m 53s | Max: 12m 44s
      🟨 Clang14            Pass:  75%/4   | Total: 49m 17s | Avg: 12m 19s | Max: 14m 42s
      🟩 Clang15            Pass: 100%/2   | Total: 25m 21s | Avg: 12m 40s | Max: 12m 50s
      🟩 Clang16            Pass: 100%/4   | Total: 45m 41s | Avg: 11m 25s | Max: 12m 22s
      🟩 Clang17            Pass: 100%/2   | Total: 25m 29s | Avg: 12m 44s | Max: 13m 01s
      🟨 Clang18            Pass:  50%/2   | Total: 28m 17s | Avg: 14m 08s | Max: 14m 59s
      🟩 GCC9               Pass: 100%/2   | Total: 26m 15s | Avg: 13m 07s | Max: 13m 52s
      🟩 GCC10              Pass: 100%/4   | Total: 48m 07s | Avg: 12m 01s | Max: 12m 38s
      🟩 GCC11              Pass: 100%/4   | Total: 48m 29s | Avg: 12m 07s | Max: 12m 42s
      🟨 GCC12              Pass:  57%/7   | Total:  1h 34m | Avg: 13m 33s | Max: 16m 04s
      🟩 GCC13              Pass: 100%/3   | Total: 31m 11s | Avg: 10m 23s | Max: 11m 38s
      🟥 MSVC14.36          Pass:   0%/1   | Total: 10m 59s | Avg: 10m 59s | Max: 10m 59s
      🟩 MSVC14.39          Pass: 100%/1   | Total:  8m 17s | Avg:  8m 17s | Max:  8m 17s | Hits:  90%/123   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
    🟨 cxx_family
      🟨 Clang              Pass:  93%/30  | Total:  6h 01m | Avg: 12m 03s | Max: 14m 59s
      🟨 GCC                Pass:  85%/20  | Total:  4h 08m | Avg: 12m 26s | Max: 16m 04s
      🟨 MSVC               Pass:  50%/2   | Total: 19m 16s | Avg:  9m 38s | Max: 10m 59s | Hits:  90%/123   
      🟩 NVHPC              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
    🟨 cudacxx_family
      🟨 nvcc               Pass:  88%/54  | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits:  90%/123   
    🟨 gpu
      🟨 v100               Pass:  88%/54  | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits:  90%/123   
    🟨 jobs
      🟨 Build              Pass:  97%/49  | Total:  9h 25m | Avg: 11m 32s | Max: 14m 36s | Hits:  90%/123   
      🟥 Test               Pass:   0%/5   | Total:  1h 15m | Avg: 15m 00s | Max: 16m 04s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  8m 32s | Avg:  8m 32s | Max:  8m 32s
      🟩 90a                Pass: 100%/1   | Total:  9m 00s | Avg:  9m 00s | Max:  9m 00s
    🟨 std
      🟨 17                 Pass:  93%/29  | Total:  5h 40m | Avg: 11m 44s | Max: 16m 04s
      🟨 20                 Pass:  84%/25  | Total:  4h 59m | Avg: 11m 59s | Max: 15m 25s | Hits:  90%/123   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
+/- CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 54)

# Runner
43 linux-amd64-cpu16
5 linux-amd64-gpu-v100-latest-1
4 linux-arm64-cpu16
2 windows-amd64-cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant