Use P2322R6 to determine intermediate types for relevant algorithms #428

codecircuit · 2022-01-28T12:58:07Z

The reduction_op is called with a type derived from the output iterator. Example:

#include <cub/cub.cuh>
#include <iostream>

struct CustomMin {
	template <typename T>
	__device__ __host__ __forceinline__ T operator()(const T& a,
	                                                 const T& b) const {
		return (b < a) ? b : a;
	}
};

struct result_t {
	result_t() = default;
	__device__ __host__ result_t(int i) : data(i) {
	}

	double data = 0;
};

inline std::ostream& operator<<(std::ostream& os, const result_t& r) {
	os << r.data;
	return os;
}

int main() {
	int* d_offsets;
	int* d_in;
	using OT = result_t;
	//using OT = int; // compiles
	OT* d_out;
	CustomMin min_op;

	int initial_value = INT_MAX;

	const int num_segments = 3;
	const int num_values = 7;

	cudaMallocManaged(&d_offsets, sizeof(int) * (num_segments + 1));
	cudaMallocManaged(&d_in, sizeof(int) * num_values);
	cudaMallocManaged(&d_out, sizeof(OT) * num_segments);

	d_offsets[0] = 0;
	d_offsets[1] = 3;
	d_offsets[2] = 3;
	d_offsets[3] = 7;

	d_in[0] = 8;
	d_in[1] = 6;
	d_in[2] = 7;
	d_in[3] = 5;
	d_in[4] = 3;
	d_in[5] = 0;
	d_in[6] = 9;

	void* d_temp_storage = NULL;
	size_t temp_storage_bytes = 0;

	cub::DeviceSegmentedReduce::Reduce(d_temp_storage,
	                                   temp_storage_bytes,
	                                   d_in,
	                                   d_out,
	                                   num_segments,
	                                   d_offsets,
	                                   d_offsets + 1,
	                                   min_op,
	                                   initial_value);

	cudaMalloc(&d_temp_storage, temp_storage_bytes);

	cub::DeviceSegmentedReduce::Reduce(d_temp_storage,
	                                   temp_storage_bytes,
	                                   d_in,
	                                   d_out,
	                                   num_segments,
	                                   d_offsets,
	                                   d_offsets + 1,
	                                   min_op,
	                                   initial_value);
	cudaDeviceSynchronize();

	for (int i = 0; i < num_segments; ++i) {
		std::cout << "d_out[" << i << "] = " << d_out[i] << std::endl;
	}

	cudaFree(d_offsets);
	cudaFree(d_in);
	cudaFree(d_out);
	cudaFree(d_temp_storage);
}

If this snippet is compiled with CUDA 11.4.48, CUB on commit 93f26ab and Thrust on commit 0b00326becfdd7a78182b36d0752c41b341863b2, which represent the current state of the default branches of CUB and Thrust, you recieve the error:

reduction.cu(8): error: no operator "<" matches these operands
            operand types are: const cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>> < const cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>>
          detected during:
            instantiation of "T CustomMin::operator()(const T &, const T &) const [with T=cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>>]" 
/home/cklein2/repositories/cub/cub/device/dispatch/../../agent/agent_reduce.cuh(297): here
            instantiation of "void cub::AgentReduce<AgentReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOp>::ConsumeTile<IS_FIRST_TILE,CAN_VECTORIZE>(cub::AgentReduce<AgentReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOp>::OutputT &, OffsetT, int, cub::Int2Type<0>, cub::Int2Type<CAN_VECTORIZE>) [with AgentReducePolicy=cub::AgentReducePolicy<256, 16, cub::detail::value_t<int *>, 4, cub::BLOCK_REDUCE_WARP_REDUCTIONS, cub::LOAD_LDG, cub::MemBoundScaling<256, 16, cub::detail::value_t<int *>>>, InputIteratorT=int *, OutputIteratorT=result_t *, OffsetT=int, ReductionOp=CustomMin, IS_FIRST_TILE=1, CAN_VECTORIZE=1]"
...

The compilation also fails for the CUB and Thrust installation, which comes with the CUDA Toolkit 11.4.48.
Clearly, the reduction operator is invoked with a type based on result_t. The documentation currently says:

ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T | [inferred] Data element type that is convertible to the value type of InputIteratorT

I think it is intuitive to assume that the reduction operator is called with a type derived from the input iterator and not from the output iterator. So maybe the documentation can state more precisely how type T is derived.

The text was updated successfully, but these errors were encountered:

alliepiper · 2022-02-09T19:16:37Z

Agreed -- this should use either the initial value type or input iterator value type for consistency with https://wg21.link/P0571. We made this change a while ago for the scan algorithms and need to update the others as well.

alliepiper · 2022-02-09T19:18:50Z

Updated the title to be more general, as we should verify that the other algorithms listed in the proposal are following these conventions, too.

alliepiper · 2022-05-16T18:50:23Z

@brycelelbach mentioned that there's an updated version of this proposal under a different name -- check with him to get the latest recommendations.

gevtushenko · 2022-06-03T11:17:16Z

The alternative approach is described in P2322R6

alliepiper · 2022-06-13T16:28:01Z

There's a newer proposal than P0571 that we'll be using (@senior-zero linked this paper above). The intermediate type is determined by the result of the operator.

gevtushenko · 2022-08-03T16:12:35Z

Relevant changes were merged in the following PRs:

alliepiper changed the title ~~cub::DeviceSegmentedReduce::Reduce with different input and output types~~ Use P0571 to determine intermediate types for relevant algorithms Feb 9, 2022

alliepiper added type: bug: functional Does not work as intended. P1: should have Necessary, but not critical. labels Feb 9, 2022

alliepiper added this to the 1.17.0 milestone Feb 9, 2022

alliepiper modified the milestones: 1.17.0, 2.0.0 Apr 25, 2022

alliepiper assigned gevtushenko May 16, 2022

This was referenced Jun 7, 2022

Update cuda::std::invoke_result to emit diagnostic when attempting to query return type of extended lambda in host code NVIDIA/libcudacxx#277

Closed

P2322R6 accumulator types for reduce #509

Merged

alliepiper changed the title ~~Use P0571 to determine intermediate types for relevant algorithms~~ Use P2322R6 to determine intermediate types for relevant algorithms Jun 13, 2022

gevtushenko mentioned this issue Jun 16, 2022

P2322R6 accumulator types for scan and reduce by key #511

Merged

gevtushenko closed this as completed Aug 3, 2022

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad removed this from CCCL Aug 11, 2022

gevtushenko mentioned this issue Jun 30, 2023

[FEA]: Use P2322R6 to determine accumulator type in Thrust NVIDIA/cccl#153

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use P2322R6 to determine intermediate types for relevant algorithms #428

Use P2322R6 to determine intermediate types for relevant algorithms #428

codecircuit commented Jan 28, 2022 •

edited

Loading

alliepiper commented Feb 9, 2022

alliepiper commented Feb 9, 2022

alliepiper commented May 16, 2022

gevtushenko commented Jun 3, 2022 •

edited

Loading

alliepiper commented Jun 13, 2022

gevtushenko commented Aug 3, 2022

Use P2322R6 to determine intermediate types for relevant algorithms #428

Use P2322R6 to determine intermediate types for relevant algorithms #428

Comments

codecircuit commented Jan 28, 2022 • edited Loading

alliepiper commented Feb 9, 2022

alliepiper commented Feb 9, 2022

alliepiper commented May 16, 2022

gevtushenko commented Jun 3, 2022 • edited Loading

alliepiper commented Jun 13, 2022

gevtushenko commented Aug 3, 2022

codecircuit commented Jan 28, 2022 •

edited

Loading

gevtushenko commented Jun 3, 2022 •

edited

Loading