[STF] reduce access mode #2830

caugonnet · 2024-11-15T13:23:06Z

Description

closes

This PR intends to introduce a reduction access mode to make it much easier to write parallel_for kernels which also perform some reductions to a logical data.

Checklist

[ x] New or existing tests cover these changes.
The documentation is up to date with these changes.

…ls (eg. parallel_for).

copy-pr-bot · 2024-11-15T13:23:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

caugonnet · 2024-11-15T13:24:02Z

/ok to test

…deps

caugonnet · 2024-11-15T16:08:38Z

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

@@ -423,7 +452,7 @@ public:
      Fun&& f                  = mv(::std::get<2>(*p));
      const sub_shape_t& shape = ::std::get<3>(*p);

-      auto explode_coords = [&](size_t i, deps_t... data) {
+      auto explode_coords = [&](size_t i, typename deps_ops_t::first_type... data) {


Need a comment to say what that type is (we can't use an alias because it's a pack, not a tuple)

…n access mode, and start to implement all the mechanisms for reductions in parallel_for

… to cuda::std::tuple

caugonnet · 2024-11-20T22:14:21Z

cudax/include/cuda/experimental/__stf/internal/task_dep.cuh

+public:
+  // no-op operator
+  template <typename T>
+  static __host__ __device__ void apply_op(T&, const T&)


we should not need that

caugonnet · 2024-11-21T08:44:37Z

/ok to test

caugonnet · 2024-11-21T11:25:13Z

/ok to test

caugonnet · 2024-11-21T14:04:46Z

/ok to test

…_host method

caugonnet · 2024-11-21T16:01:03Z

/ok to test

caugonnet · 2024-11-21T16:03:52Z

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

+  // arguments, or an owning local variable for reduction variables.
+  // extern __shared__ redux_buffer_tup_wrapper<tuple_args, tuple_ops> per_block_redux_buffer[];
+  extern __shared__ char dyn_buffer[];
+  auto* per_block_redux_buffer = (redux_buffer_tup_wrapper<tuple_args, tuple_ops>*) ((void*) dyn_buffer);


this weirdness is due to the fact that external symbols won't work with different type for the same symbol

caugonnet · 2024-11-21T16:04:16Z

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

+  // Write the block's result to the output array
+  if (tid == 0)
+  {
+    tuple_set_op<tuple_ops>(redux_buffer[blockIdx.x], per_block_redux_buffer[0].get());


specialize if only one block...

…ntext

…ting value, or initialize a new one

caugonnet · 2024-11-22T14:52:31Z

/ok to test

caugonnet · 2024-11-22T14:53:08Z

/ok to test

caugonnet · 2024-11-22T16:30:35Z

/ok to test

caugonnet · 2024-11-22T22:29:45Z

/ok to test

caugonnet · 2024-11-23T07:08:27Z

/ok to test

caugonnet · 2024-11-23T09:02:18Z

/ok to test

caugonnet · 2024-11-23T09:45:43Z

/ok to test

caugonnet · 2024-11-23T09:57:15Z

/ok to test

caugonnet · 2024-11-23T10:21:47Z

/ok to test

github-actions · 2024-11-23T10:56:04Z

🟨 CI finished in 32m 02s: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123

🟨 cudax: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  88%/50  | Total:  9h 57m | Avg: 11m 56s | Max: 16m 04s | Hits:  90%/123   
  🟩 arm64              Pass: 100%/4   | Total: 43m 23s | Avg: 10m 50s | Max: 11m 38s
🟨 ctk
  🟨 12.0               Pass:  84%/19  | Total:  3h 44m | Avg: 11m 48s | Max: 14m 42s
  🟩 12.5               Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
  🟨 12.6               Pass:  90%/33  | Total:  6h 45m | Avg: 12m 17s | Max: 16m 04s | Hits:  90%/123   
🟨 cudacxx
  🟨 nvcc12.0           Pass:  84%/19  | Total:  3h 44m | Avg: 11m 48s | Max: 14m 42s
  🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
  🟨 nvcc12.6           Pass:  90%/33  | Total:  6h 45m | Avg: 12m 17s | Max: 16m 04s | Hits:  90%/123   
🟨 cxx
  🟩 Clang9             Pass: 100%/2   | Total: 23m 10s | Avg: 11m 35s | Max: 12m 19s
  🟩 Clang10            Pass: 100%/2   | Total: 23m 34s | Avg: 11m 47s | Max: 12m 26s
  🟩 Clang11            Pass: 100%/4   | Total: 45m 48s | Avg: 11m 27s | Max: 11m 46s
  🟩 Clang12            Pass: 100%/4   | Total: 47m 40s | Avg: 11m 55s | Max: 12m 22s
  🟩 Clang13            Pass: 100%/4   | Total: 47m 33s | Avg: 11m 53s | Max: 12m 44s
  🟨 Clang14            Pass:  75%/4   | Total: 49m 17s | Avg: 12m 19s | Max: 14m 42s
  🟩 Clang15            Pass: 100%/2   | Total: 25m 21s | Avg: 12m 40s | Max: 12m 50s
  🟩 Clang16            Pass: 100%/4   | Total: 45m 41s | Avg: 11m 25s | Max: 12m 22s
  🟩 Clang17            Pass: 100%/2   | Total: 25m 29s | Avg: 12m 44s | Max: 13m 01s
  🟨 Clang18            Pass:  50%/2   | Total: 28m 17s | Avg: 14m 08s | Max: 14m 59s
  🟩 GCC9               Pass: 100%/2   | Total: 26m 15s | Avg: 13m 07s | Max: 13m 52s
  🟩 GCC10              Pass: 100%/4   | Total: 48m 07s | Avg: 12m 01s | Max: 12m 38s
  🟩 GCC11              Pass: 100%/4   | Total: 48m 29s | Avg: 12m 07s | Max: 12m 42s
  🟨 GCC12              Pass:  57%/7   | Total:  1h 34m | Avg: 13m 33s | Max: 16m 04s
  🟩 GCC13              Pass: 100%/3   | Total: 31m 11s | Avg: 10m 23s | Max: 11m 38s
  🟥 MSVC14.36          Pass:   0%/1   | Total: 10m 59s | Avg: 10m 59s | Max: 10m 59s
  🟩 MSVC14.39          Pass: 100%/1   | Total:  8m 17s | Avg:  8m 17s | Max:  8m 17s | Hits:  90%/123   
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
🟨 cxx_family
  🟨 Clang              Pass:  93%/30  | Total:  6h 01m | Avg: 12m 03s | Max: 14m 59s
  🟨 GCC                Pass:  85%/20  | Total:  4h 08m | Avg: 12m 26s | Max: 16m 04s
  🟨 MSVC               Pass:  50%/2   | Total: 19m 16s | Avg:  9m 38s | Max: 10m 59s | Hits:  90%/123   
  🟩 NVHPC              Pass: 100%/2   | Total: 10m 27s | Avg:  5m 13s | Max:  5m 15s
🟨 cudacxx_family
  🟨 nvcc               Pass:  88%/54  | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits:  90%/123   
🟨 gpu
  🟨 v100               Pass:  88%/54  | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits:  90%/123   
🟨 jobs
  🟨 Build              Pass:  97%/49  | Total:  9h 25m | Avg: 11m 32s | Max: 14m 36s | Hits:  90%/123   
  🟥 Test               Pass:   0%/5   | Total:  1h 15m | Avg: 15m 00s | Max: 16m 04s
🟩 sm
  🟩 90                 Pass: 100%/1   | Total:  8m 32s | Avg:  8m 32s | Max:  8m 32s
  🟩 90a                Pass: 100%/1   | Total:  9m 00s | Avg:  9m 00s | Max:  9m 00s
🟨 std
  🟨 17                 Pass:  93%/29  | Total:  5h 40m | Avg: 11m 44s | Max: 16m 04s
  🟨 20                 Pass:  84%/25  | Total:  4h 59m | Avg: 11m 59s | Max: 15m 25s | Hits:  90%/123

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
+/-	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
+/-	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 54)

#	Runner
43	`linux-amd64-cpu16`
5	`linux-amd64-gpu-v100-latest-1`
4	`linux-arm64-cpu16`
2	`windows-amd64-cpu16`

caugonnet added 5 commits November 12, 2024 14:58

Rename the redux access mode into relaxed

05c696c

change redux to relaxed in the documentation

3c954c6

clang-format

9cbc0ec

Merge branch 'main' into stf_reducer_access_mode

16938fc

Experiment to start introducing a reduction access mode used in kerne…

5f71031

…ls (eg. parallel_for).

Add a trait to count the number of reductions required in a tuple of …

bdedda2

…deps

caugonnet commented Nov 15, 2024

View reviewed changes

caugonnet added 7 commits November 18, 2024 14:18

WIP: create a new scalar<T> interface which can be used in a reductio…

dec19cd

…n access mode, and start to implement all the mechanisms for reductions in parallel_for

WIP ! Introduce owning_container_of trait class

1c883fa

WIP: save progress here, lots of hardcoded things and we need to move…

09ffbd9

… to cuda::std::tuple

WIP : first prototype working...

97e0ba5

Proper initialization of shared memory buffers, and add another example

187be94

Some cleanups and renaming of classes for better clarity

bdfbb03

clang-format

a6a2018

caugonnet commented Nov 20, 2024

View reviewed changes

caugonnet added the stf Sequential Task Flow programming model label Nov 21, 2024

Merge branch 'main' into stf_reducer_access_mode

86b3293

workaround some false unused captured variable warning

ce0a30f

Fix various C++ errors, and do not use the I variable

374d253

caugonnet added 2 commits November 21, 2024 16:48

Rework the CFD example to use reductions, and generalize the transfer…

7c28b1e

…_host method

clang-format

8f552e0

caugonnet commented Nov 21, 2024

View reviewed changes

Implement transfer_host (name subject to change !) directly in the co…

438589b

…ntext

caugonnet added 4 commits November 22, 2024 10:20

clang-format

986f1c6

Make it possible to either accumulate a reduction result with an exis…

838919f

…ting value, or initialize a new one

Implement a set of predefined reducers

245b09b

clang-format

199fa8f

move the definition of do_init and no_init

8bf8e23

caugonnet added 5 commits November 22, 2024 20:59

update word count example

ba332e6

Code simplification to facilitate the transition to ::cuda::std::tuple

59363dc

Use ::cuda::std::tuple for reduction variables

5f368c3

use proper type for the size of buffers

1af7913

clang-format

8d14a05

remove unused variables

ce859c4

fix buffer size

28a0187

add missing typename

415fcc0

Add missing typename

5ba137a

caugonnet added 2 commits November 23, 2024 11:20

Add maybe_unused for variables currently unused in a WIP code

4e9d87e

clang-format

b3c3375

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STF] reduce access mode #2830

[STF] reduce access mode #2830

caugonnet commented Nov 15, 2024

copy-pr-bot bot commented Nov 15, 2024

caugonnet commented Nov 15, 2024

caugonnet Nov 15, 2024

caugonnet Nov 20, 2024

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet Nov 21, 2024

caugonnet Nov 21, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

🟨 cudax: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 54)

[STF] reduce access mode #2830

Are you sure you want to change the base?

[STF] reduce access mode #2830

Conversation

caugonnet commented Nov 15, 2024

Description

Checklist

copy-pr-bot bot commented Nov 15, 2024

caugonnet commented Nov 15, 2024

caugonnet Nov 15, 2024

Choose a reason for hiding this comment

caugonnet Nov 20, 2024

Choose a reason for hiding this comment

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet commented Nov 21, 2024

caugonnet Nov 21, 2024

Choose a reason for hiding this comment

caugonnet Nov 21, 2024

Choose a reason for hiding this comment

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 22, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

caugonnet commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

🟨 cudax: Pass: 88%/54 | Total: 10h 40m | Avg: 11m 51s | Max: 16m 04s | Hits: 90%/123

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 54)