Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Use P2322R6 to determine intermediate types for relevant algorithms #428

Closed
codecircuit opened this issue Jan 28, 2022 · 6 comments
Closed
Assignees
Labels
P1: should have Necessary, but not critical. type: bug: functional Does not work as intended.
Milestone

Comments

@codecircuit
Copy link

codecircuit commented Jan 28, 2022

The reduction_op is called with a type derived from the output iterator. Example:

#include <cub/cub.cuh>
#include <iostream>

struct CustomMin {
	template <typename T>
	__device__ __host__ __forceinline__ T operator()(const T& a,
	                                                 const T& b) const {
		return (b < a) ? b : a;
	}
};

struct result_t {
	result_t() = default;
	__device__ __host__ result_t(int i) : data(i) {
	}

	double data = 0;
};

inline std::ostream& operator<<(std::ostream& os, const result_t& r) {
	os << r.data;
	return os;
}

int main() {
	int* d_offsets;
	int* d_in;
	using OT = result_t;
	//using OT = int; // compiles
	OT* d_out;
	CustomMin min_op;

	int initial_value = INT_MAX;

	const int num_segments = 3;
	const int num_values = 7;

	cudaMallocManaged(&d_offsets, sizeof(int) * (num_segments + 1));
	cudaMallocManaged(&d_in, sizeof(int) * num_values);
	cudaMallocManaged(&d_out, sizeof(OT) * num_segments);

	d_offsets[0] = 0;
	d_offsets[1] = 3;
	d_offsets[2] = 3;
	d_offsets[3] = 7;

	d_in[0] = 8;
	d_in[1] = 6;
	d_in[2] = 7;
	d_in[3] = 5;
	d_in[4] = 3;
	d_in[5] = 0;
	d_in[6] = 9;

	void* d_temp_storage = NULL;
	size_t temp_storage_bytes = 0;

	cub::DeviceSegmentedReduce::Reduce(d_temp_storage,
	                                   temp_storage_bytes,
	                                   d_in,
	                                   d_out,
	                                   num_segments,
	                                   d_offsets,
	                                   d_offsets + 1,
	                                   min_op,
	                                   initial_value);

	cudaMalloc(&d_temp_storage, temp_storage_bytes);

	cub::DeviceSegmentedReduce::Reduce(d_temp_storage,
	                                   temp_storage_bytes,
	                                   d_in,
	                                   d_out,
	                                   num_segments,
	                                   d_offsets,
	                                   d_offsets + 1,
	                                   min_op,
	                                   initial_value);
	cudaDeviceSynchronize();

	for (int i = 0; i < num_segments; ++i) {
		std::cout << "d_out[" << i << "] = " << d_out[i] << std::endl;
	}

	cudaFree(d_offsets);
	cudaFree(d_in);
	cudaFree(d_out);
	cudaFree(d_temp_storage);
}

If this snippet is compiled with CUDA 11.4.48, CUB on commit 93f26ab and Thrust on commit 0b00326becfdd7a78182b36d0752c41b341863b2, which represent the current state of the default branches of CUB and Thrust, you recieve the error:

reduction.cu(8): error: no operator "<" matches these operands
            operand types are: const cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>> < const cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>>
          detected during:
            instantiation of "T CustomMin::operator()(const T &, const T &) const [with T=cub::detail::non_void_value_t<result_t *, cub::detail::value_t<int *>>]" 
/home/cklein2/repositories/cub/cub/device/dispatch/../../agent/agent_reduce.cuh(297): here
            instantiation of "void cub::AgentReduce<AgentReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOp>::ConsumeTile<IS_FIRST_TILE,CAN_VECTORIZE>(cub::AgentReduce<AgentReducePolicy, InputIteratorT, OutputIteratorT, OffsetT, ReductionOp>::OutputT &, OffsetT, int, cub::Int2Type<0>, cub::Int2Type<CAN_VECTORIZE>) [with AgentReducePolicy=cub::AgentReducePolicy<256, 16, cub::detail::value_t<int *>, 4, cub::BLOCK_REDUCE_WARP_REDUCTIONS, cub::LOAD_LDG, cub::MemBoundScaling<256, 16, cub::detail::value_t<int *>>>, InputIteratorT=int *, OutputIteratorT=result_t *, OffsetT=int, ReductionOp=CustomMin, IS_FIRST_TILE=1, CAN_VECTORIZE=1]"
...

The compilation also fails for the CUB and Thrust installation, which comes with the CUDA Toolkit 11.4.48.
Clearly, the reduction operator is invoked with a type based on result_t. The documentation currently says:

ReductionOp | [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)
T | [inferred] Data element type that is convertible to the value type of InputIteratorT

I think it is intuitive to assume that the reduction operator is called with a type derived from the input iterator and not from the output iterator. So maybe the documentation can state more precisely how type T is derived.

@alliepiper
Copy link
Collaborator

Agreed -- this should use either the initial value type or input iterator value type for consistency with https://wg21.link/P0571. We made this change a while ago for the scan algorithms and need to update the others as well.

@alliepiper alliepiper changed the title cub::DeviceSegmentedReduce::Reduce with different input and output types Use P0571 to determine intermediate types for relevant algorithms Feb 9, 2022
@alliepiper alliepiper added type: bug: functional Does not work as intended. P1: should have Necessary, but not critical. labels Feb 9, 2022
@alliepiper alliepiper added this to the 1.17.0 milestone Feb 9, 2022
@alliepiper
Copy link
Collaborator

Updated the title to be more general, as we should verify that the other algorithms listed in the proposal are following these conventions, too.

@alliepiper
Copy link
Collaborator

@brycelelbach mentioned that there's an updated version of this proposal under a different name -- check with him to get the latest recommendations.

@gevtushenko
Copy link
Collaborator

gevtushenko commented Jun 3, 2022

The alternative approach is described in P2322R6

@alliepiper alliepiper changed the title Use P0571 to determine intermediate types for relevant algorithms Use P2322R6 to determine intermediate types for relevant algorithms Jun 13, 2022
@alliepiper
Copy link
Collaborator

There's a newer proposal than P0571 that we'll be using (@senior-zero linked this paper above). The intermediate type is determined by the result of the operator.

@gevtushenko
Copy link
Collaborator

Relevant changes were merged in the following PRs:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1: should have Necessary, but not critical. type: bug: functional Does not work as intended.
Projects
None yet
Development

No branches or pull requests

3 participants