Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Add unique_count algorithm #1619

Merged
merged 10 commits into from
May 7, 2022
Merged

Add unique_count algorithm #1619

merged 10 commits into from
May 7, 2022

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Feb 13, 2022

Add a counting equivalent to unique_* algorithms that can be used to allocate the correct amount of data before actually filling it.

The formatting is a bit all over the place, I just tried to fit the environment as well as possible.

I'll run a few benchmarks to see how this fares performance-wise (I am hoping I won't need to use cub::DeviceRunLengthEncode::Encode, because that will be more work to integrate, and does more work internally)

I tested everything with the CUDA, OMP and CPU backends, don't have a TBB installation on hand, but the code is equivalent to OpenMP, so hopefully I didn't break anything :)

Closes #1612

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@upsj
Copy link
Contributor Author

upsj commented Feb 13, 2022

Here a quick comparison with plain count_if on different GPUs
unique_count1
unique_count2
unique_count3

Code:

#include "thrust/unique.h"
#include "thrust/device_vector.h"
#include "thrust/iterator/zip_iterator.h"
#include "thrust/tuple.h"
#include "thrust/sequence.h"

#include <chrono>
#include <iostream>

template <typename ForwardIt>
void benchmark(std::string desc, ForwardIt begin, ForwardIt end) {
	thrust::unique_count(begin, end);
	cudaDeviceSynchronize();
	const auto reps = 100;
	auto start = std::chrono::high_resolution_clock::now();
	for (int i = 0; i < reps; i++) {
		thrust::unique_count(begin, end);
	}
	cudaDeviceSynchronize();
	auto total = std::chrono::high_resolution_clock::now() - start;
	cudaDeviceSynchronize();
	start = std::chrono::high_resolution_clock::now();
	for (int i = 0; i < reps; i++) {
		thrust::count_if(begin, end, []__host__ __device__(typename ForwardIt::value_type val) { return val >= 0; }); // always true
	}
	cudaDeviceSynchronize();
	auto total2 = std::chrono::high_resolution_clock::now() - start;
	std::cout << desc << ',' << std::chrono::duration_cast<std::chrono::nanoseconds>(total).count() / reps
	                  << ',' << std::chrono::duration_cast<std::chrono::nanoseconds>(total2).count() / reps << '\n';
}

int main() {
	{
		thrust::device_vector<int> data(1u << 30);
		thrust::sequence(data.begin(), data.end(), 0);
		for (long i = 1; i <= data.size(); i *= 2) {
			benchmark("int," + std::to_string(i), data.begin(), data.begin() + i);
		}
	}
	{
		thrust::device_vector<long> data(1u << 29);
		thrust::sequence(data.begin(), data.end(), 0);
		for (long i = 1; i <= data.size(); i *= 2) {
			benchmark("long," + std::to_string(i), data.begin(), data.begin() + i);
		}
	}
}

Is this kind of overhead acceptable?

@alliepiper alliepiper added type: enhancement New feature or request. P2: nice to have Desired, but not necessary. labels Feb 18, 2022
@alliepiper alliepiper added this to the 1.17.0 milestone Feb 18, 2022
Copy link
Collaborator

@ericniebler ericniebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Needs a few tweaks.

thrust/system/cuda/detail/unique.h Outdated Show resolved Hide resolved
thrust/system/cuda/detail/unique.h Outdated Show resolved Hide resolved
if (first == last) {
return 0;
}
auto size = last - first;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm should work with forward iterators, but taking the difference of two iterators requires random access. You'll need to find a different way to implement this. @jrhemstad and @allisonvacanti suggest a possible implementation in #1619.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose thrust::distance would be appropriate then? I am having a hard time imagining how you would implement this parallel algorithm without random access iterators.

Regardless of that, I'll give DeviceRunLengthEncode::Encode a try, since it's simpler than I thought on my first look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like cub doesn't like discard_iterator, since it writes and reads the first key from each run. So I think this is not possible without additional temporary storage O(size)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using thrust::distance is correct here.

Could you file a CUB issue describing the discard iterator bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this has already been reported: https://github.com/NVIDIA/thrust/issues/1490

Copy link
Contributor Author

@upsj upsj Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, DeviceRunLengthEncode::Encode works with cub::DiscardOutputIterator, but (at least on the Titan X I tested on) gives significantly worse performance, comparison (GB/s bandwidth)

int

n count_if Encode
16M 179.173 109.111
1B 254.523 160.037

long

n count_if Encode
16M 259.123 180.335
512M 315.969 215.304

pair<long, int>

n count_if Encode
16M 351.678 284.543
256M 392.726 331.611


ASSERT_EQUAL(div_10_count, 3);
}
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_count claims to work with forward iterators. There should be a test using forward iterators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nice way to wrap an iterator into a forward_iterator in Thrust? I wrote a small wrapper class and that seems to work and compile, but I suppose that problem has been solved elsewhere already?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericniebler is right, we should be testing this. Unfortunately we lack robust testing for these sorts of things.

Thanks for adding the new testing infrastructure! Please include it in this PR, ideally in the testing framework so we can reuse it from other tests later 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to unique_copy and unique. I am not 100% sure it does what we would expect - due to the missing iterator tag, it gets executed sequentially on the CPU using device references for access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, forward iterators always dispatch to the CPU? @allisonvacanti can you comment on that? I mean, it seems reasonable to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounded odd to me, as I've never seen any logic in Thrust that would dispatch forward iterators to serial implementations. So I dug into it, and unfortunately this is due to a pretty nasty bug in Thrust's iterator traits.

The details are gory, but I've summarized them in a comment on #902.

Until that's fixed, I'm not comfortable merging this test using the forward_iterator_wrapper, since they only "do the right thing" because the iterator framework is broken.

I hate to churn on this PR even more, but I think we should remove the iterator wrappers for now and just test that the regular iterators work. We can re-introduce the wrapper tests as part of #55, after #902 is fixed and settled.

@ericniebler Can you review the two linked issues and see if you agree with my suggestion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that forward iterators actually need to be dispatched to the sequential backend. They support multipass reading and should be usable in a parallel algorithm, so long as they're only copied and incremented. Is there something in the unique_count/count_if algorithms that would break them?

Copy link
Contributor Author

@upsj upsj May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I see with parallel execution on forward iterators is that they introduce an essentially linear dependency chain that means either every thread i starts from begin and increments it i times, or waits until one of its predecessors j is done and writes its iterator somewhere, to then increments it i - j times. Both don't really seem useful for parallel execution to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will require more increments, but if the work-per-element is expensive compared to the cost of the iterator increment, it may still make sense to parallelize. I'd rather let the user make that call, since they can opt-in to sequential execution by passing in the sequential exeuction policy (thrust::seq).

More importantly, the sequential implementation executes on CPU, and some types of device memory aren't accessible from the CPU's address space, so switching to seq really needs to be opt-in rather than default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that makes sense! I was only thinking of simple but massively parallel cases.

@jrhemstad
Copy link
Collaborator

Is this kind of overhead acceptable?

I'm suspicious of why this would be slower than the count_if benchmark by a non-trivial amount. It should be moving the same amount of data from global memory. Technically each thread now loads 2 elements, but one of those is also loaded by the adjacent thread such that when threads load the first element it is coalesced and loading the second element it is coalesced and should easily hit in L1/L2.

@upsj if you're interested, I'd be curious to see an Nsight Compute profile of those two kernels with emphasis on the memory movement summary. That would tell us if unique_count somehow has more global memory traffic.

@upsj
Copy link
Contributor Author

upsj commented Mar 11, 2022

@jrhemstad Yes, I thought the same thing. I don't have Nsight Compute setup on my laptop, but I can give you some nvprof numbers on a Titan X (int with 1B elements):

pure count_if count_unique
Global Load Transactions 536870914 1073741850 roughly 2x
L2 Read Transactions 134218920 301989986 a bit more than 2x
Device Memory Read Transactions 134217854 163413796 roughly 25% more
Unified Cache Throughput 280.71GB/s 478.92GB/s significantly less than 2x
Global Load Throughput 280.71GB/s 538.79GB/s slightly less than 2x

Occupancy, IPC etc. are similar, so it seems like the decrease in achieved throughput is based on higher amount of DRAM accesses. I am not familiar enough with the details of texture cache etc. to interpret some of the other results (especially the difference in L2 texture cache hit rate), Of course, with each element only being read once, L2 doesn't really matter for plain count_if. I'll just dump them below in case you are interested:

    Kernel: _ZN3cub18DeviceReduceKernelINS_18DeviceReducePolicyIlliN6thrust4plusIlEEE9Policy600ENS2_8cuda_cub26transform_input_iterator_tIlNS2_6detail15normal_iteratorINS2_10device_ptrIiEEEEZ9benchmarkISD_EvNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEET_SL_EUliE_EEPliS4_EEvT0_T1_T2_NS_13GridEvenShareISR_EET3_
          1                             inst_per_warp                                                 Instructions per warp  1.9543e+04  1.9543e+04  1.9543e+04
          1                         branch_efficiency                                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                                             Warp Execution Efficiency      99.99%      99.99%      99.99%
          1         warp_nonpred_execution_efficiency                              Warp Non-Predicated Execution Efficiency      99.95%      99.95%      99.95%
          1                      inst_replay_overhead                                           Instruction Replay Overhead    0.000073    0.000073    0.000073
          1      shared_load_transactions_per_request                           Shared Memory Load Transactions Per Request    1.750000    1.750000    1.750000
          1     shared_store_transactions_per_request                          Shared Memory Store Transactions Per Request    2.000000    2.000000    2.000000
          1       local_load_transactions_per_request                            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1      local_store_transactions_per_request                           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                                  Global Load Transactions Per Request   16.000000   16.000000   16.000000
          1              gst_transactions_per_request                                 Global Store Transactions Per Request    1.000000    1.000000    1.000000
          1                 shared_store_transactions                                             Shared Store Transactions       17920       17920       17920
          1                  shared_load_transactions                                              Shared Load Transactions        7840        7840        7840
          1                   local_load_transactions                                               Local Load Transactions           0           0           0
          1                  local_store_transactions                                              Local Store Transactions           0           0           0
          1                          gld_transactions                                              Global Load Transactions   536870914   536870914   536870914
          1                          gst_transactions                                             Global Store Transactions        1120        1120        1120
          1                  sysmem_read_transactions                                       System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                                      System Memory Write Transactions           5           5           5
          1                      l2_read_transactions                                                  L2 Read Transactions   134218920   134218920   134218920
          1                     l2_write_transactions                                                 L2 Write Transactions        1133        1133        1133
          1                           global_hit_rate                                     Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
          1                            local_hit_rate                                                        Local Hit Rate       0.00%       0.00%       0.00%
          1                  gld_requested_throughput                                      Requested Global Load Throughput  280.71GB/s  280.71GB/s  280.71GB/s
          1                  gst_requested_throughput                                     Requested Global Store Throughput  614.05KB/s  614.05KB/s  614.05KB/s
          1                            gld_throughput                                                Global Load Throughput  280.71GB/s  280.71GB/s  280.71GB/s
          1                            gst_throughput                                               Global Store Throughput  2.3986MB/s  2.3986MB/s  2.3986MB/s
          1                     local_memory_overhead                                                 Local Memory Overhead       0.00%       0.00%       0.00%
          1                        tex_cache_hit_rate                                                Unified Cache Hit Rate      50.00%      50.00%      50.00%
          1                      l2_tex_read_hit_rate                                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
          1                     l2_tex_write_hit_rate                                          L2 Hit Rate (Texture Writes)      81.25%      81.25%      81.25%
          1                      tex_cache_throughput                                              Unified Cache Throughput  280.71GB/s  280.71GB/s  280.71GB/s
          1                    l2_tex_read_throughput                                         L2 Throughput (Texture Reads)  280.71GB/s  280.71GB/s  280.71GB/s
          1                   l2_tex_write_throughput                                        L2 Throughput (Texture Writes)  2.3986MB/s  2.3986MB/s  2.3986MB/s
          1                        l2_read_throughput                                                 L2 Throughput (Reads)  280.71GB/s  280.71GB/s  280.71GB/s
          1                       l2_write_throughput                                                L2 Throughput (Writes)  2.4265MB/s  2.4265MB/s  2.4265MB/s
          1                    sysmem_read_throughput                                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                                        System Memory Write Throughput  10.965KB/s  10.965KB/s  10.964KB/s
          1                     local_load_throughput                                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput                                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput                                         Shared Memory Load Throughput  67.162MB/s  67.162MB/s  67.162MB/s
          1                   shared_store_throughput                                        Shared Memory Store Throughput  153.51MB/s  153.51MB/s  153.51MB/s
          1                            gld_efficiency                                         Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency                                        Global Memory Store Efficiency      25.00%      25.00%      25.00%
          1                    tex_cache_transactions                                            Unified Cache Transactions   134217728   134217728   134217728
          1                             flop_count_dp                           Floating Point Operations(Double Precision)           0           0           0
          1                         flop_count_dp_add                       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_fma                       Floating Point Operations(Double Precision FMA)           0           0           0
          1                         flop_count_dp_mul                       Floating Point Operations(Double Precision Mul)           0           0           0
          1                             flop_count_sp                           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add                       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_fma                       Floating Point Operations(Single Precision FMA)           0           0           0
          1                         flop_count_sp_mul                        Floating Point Operation(Single Precision Mul)           0           0           0
          1                     flop_count_sp_special                   Floating Point Operations(Single Precision Special)           0           0           0
          1                             inst_executed                                                 Instructions Executed   175107104   175107104   175107104
          1                               inst_issued                                                   Instructions Issued   175121530   175121530   175121530
          1                        sysmem_utilization                                             System Memory Utilization     Low (1)     Low (1)     Low (1)
          1                          stall_inst_fetch                              Issue Stall Reasons (Instructions Fetch)       0.58%       0.58%       0.58%
          1                     stall_exec_dependency                            Issue Stall Reasons (Execution Dependency)       0.68%       0.68%       0.68%
          1                   stall_memory_dependency                                    Issue Stall Reasons (Data Request)      44.42%      44.42%      44.42%
          1                             stall_texture                                         Issue Stall Reasons (Texture)      31.05%      31.05%      31.05%
          1                                stall_sync                                 Issue Stall Reasons (Synchronization)       0.95%       0.95%       0.95%
          1                               stall_other                                           Issue Stall Reasons (Other)      22.09%      22.09%      22.09%
          1          stall_constant_memory_dependency                              Issue Stall Reasons (Immediate constant)       0.00%       0.00%       0.00%
          1                           stall_pipe_busy                                       Issue Stall Reasons (Pipe Busy)       0.12%       0.12%       0.12%
          1                         shared_efficiency                                              Shared Memory Efficiency       4.08%       4.08%       4.08%
          1                                inst_fp_32                                               FP Instructions(Single)           0           0           0
          1                                inst_fp_64                                               FP Instructions(Double)           0           0           0
          1                              inst_integer                                                  Integer Instructions  4087725856  4087725856  4087725856
          1                          inst_bit_convert                                              Bit-Convert Instructions           0           0           0
          1                              inst_control                                             Control-Flow Instructions   107949312   107949312   107949312
          1                        inst_compute_ld_st                                               Load/Store Instructions  1074043104  1074043104  1074043104
          1                                 inst_misc                                                     Misc Instructions   327860512   327860512   327860512
          1           inst_inter_thread_communication                                             Inter-Thread Instructions     2867200     2867200     2867200
          1                               issue_slots                                                           Issue Slots   168344960   168344960   168344960
          1                                 cf_issued                                      Issued Control-Flow Instructions     3402568     3402568     3402568
          1                               cf_executed                                    Executed Control-Flow Instructions     3402568     3402568     3402568
          1                               ldst_issued                                        Issued Load/Store Instructions   134366688   134366688   134366688
          1                             ldst_executed                                      Executed Load/Store Instructions    33712352    33712352    33712352
          1                       atomic_transactions                                                   Atomic Transactions           0           0           0
          1           atomic_transactions_per_request                                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
          1                      l2_atomic_throughput                                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_atomic_transactions                                     L2 Transactions (Atomic requests)           0           0           0
          1                  l2_tex_read_transactions                                       L2 Transactions (Texture Reads)   134217728   134217728   134217728
          1                     stall_memory_throttle                                 Issue Stall Reasons (Memory Throttle)       0.00%       0.00%       0.00%
          1                        stall_not_selected                                    Issue Stall Reasons (Not Selected)       0.10%       0.10%       0.10%
          1                 l2_tex_write_transactions                                      L2 Transactions (Texture Writes)        1120        1120        1120
          1                             flop_count_hp                             Floating Point Operations(Half Precision)           0           0           0
          1                         flop_count_hp_add                         Floating Point Operations(Half Precision Add)           0           0           0
          1                         flop_count_hp_mul                          Floating Point Operation(Half Precision Mul)           0           0           0
          1                         flop_count_hp_fma                         Floating Point Operations(Half Precision FMA)           0           0           0
          1                                inst_fp_16                                                 HP Instructions(Half)           0           0           0
          1                   sysmem_read_utilization                                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
          1               pcie_total_data_transmitted                                           PCIe Total Data Transmitted         512         512         512
          1                  pcie_total_data_received                                              PCIe Total Data Received           0           0           0
          1                inst_executed_global_loads                              Warp level instructions for global loads    33554432    33554432    33554432
          1                 inst_executed_local_loads                               Warp level instructions for local loads           0           0           0
          1                inst_executed_shared_loads                              Warp level instructions for shared loads        4480        4480        4480
          1               inst_executed_surface_loads                             Warp level instructions for surface loads           0           0           0
          1               inst_executed_global_stores                             Warp level instructions for global stores        1120        1120        1120
          1                inst_executed_local_stores                              Warp level instructions for local stores           0           0           0
          1               inst_executed_shared_stores                             Warp level instructions for shared stores        8960        8960        8960
          1              inst_executed_surface_stores                            Warp level instructions for surface stores           0           0           0
          1              inst_executed_global_atomics                  Warp level instructions for global atom and atom cas           0           0           0
          1           inst_executed_global_reductions                         Warp level instructions for global reductions           0           0           0
          1             inst_executed_surface_atomics                 Warp level instructions for surface atom and atom cas           0           0           0
          1          inst_executed_surface_reductions                        Warp level instructions for surface reductions           0           0           0
          1              inst_executed_shared_atomics                  Warp level shared instructions for atom and atom CAS           0           0           0
          1                     inst_executed_tex_ops                                   Warp level instructions for texture           0           0           0
          1                      l2_global_load_bytes       Bytes read from L2 for misses in Unified Cache for global loads  4294967296  4294967296  4294967296
          1                       l2_local_load_bytes        Bytes read from L2 for misses in Unified Cache for local loads           0           0           0
          1                     l2_surface_load_bytes      Bytes read from L2 for misses in Unified Cache for surface loads           0           0           0
          1               l2_local_global_store_bytes   Bytes written to L2 from Unified Cache for local and global stores.       35840       35840       35840
          1                 l2_global_reduction_bytes          Bytes written to L2 from Unified cache for global reductions           0           0           0
          1              l2_global_atomic_store_bytes             Bytes written to L2 from Unified cache for global atomics           0           0           0
          1                    l2_surface_store_bytes            Bytes written to L2 from Unified Cache for surface stores.           0           0           0
          1                l2_surface_reduction_bytes         Bytes written to L2 from Unified Cache for surface reductions           0           0           0
          1             l2_surface_atomic_store_bytes    Bytes transferred between Unified Cache and L2 for surface atomics           0           0           0
          1                      global_load_requests              Total number of global load requests from Multiprocessor   134217728   134217728   134217728
          1                       local_load_requests               Total number of local load requests from Multiprocessor           0           0           0
          1                     surface_load_requests             Total number of surface load requests from Multiprocessor           0           0           0
          1                     global_store_requests             Total number of global store requests from Multiprocessor        1120        1120        1120
          1                      local_store_requests              Total number of local store requests from Multiprocessor           0           0           0
          1                    surface_store_requests            Total number of surface store requests from Multiprocessor           0           0           0
          1                    global_atomic_requests            Total number of global atomic requests from Multiprocessor           0           0           0
          1                 global_reduction_requests         Total number of global reduction requests from Multiprocessor           0           0           0
          1                   surface_atomic_requests           Total number of surface atomic requests from Multiprocessor           0           0           0
          1                surface_reduction_requests        Total number of surface reduction requests from Multiprocessor           0           0           0
          1                         sysmem_read_bytes                                              System Memory Read Bytes           0           0           0
          1                        sysmem_write_bytes                                             System Memory Write Bytes         160         160         160
          1                           l2_tex_hit_rate                                                     L2 Cache Hit Rate       0.00%       0.00%       0.00%
          1                     texture_load_requests             Total number of texture Load requests from Multiprocessor           0           0           0
          1                     unique_warps_launched                                              Number of warps launched        8960        8960        8960
          1                             sm_efficiency                                               Multiprocessor Activity      99.55%      99.55%      99.55%
          1                        achieved_occupancy                                                    Achieved Occupancy    0.998635    0.998635    0.998635
          1                                       ipc                                                          Executed IPC    0.241273    0.241273    0.241273
          1                                issued_ipc                                                            Issued IPC    0.240266    0.240266    0.240266
          1                    issue_slot_utilization                                                Issue Slot Utilization       5.77%       5.77%       5.77%
          1                  eligible_warps_per_cycle                                       Eligible Warps Per Active Cycle    0.293894    0.293894    0.293894
          1                           tex_utilization                                             Unified Cache Utilization     Low (1)     Low (1)     Low (1)
          1                            l2_utilization                                                  L2 Cache Utilization     Low (2)     Low (2)     Low (2)
          1                        shared_utilization                                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
          1                       ldst_fu_utilization                                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                         cf_fu_utilization                                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                    special_fu_utilization                                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        tex_fu_utilization                                     Texture Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           single_precision_fu_utilization                            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           double_precision_fu_utilization                            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        flop_hp_efficiency                                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          1                        flop_sp_efficiency                                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_dp_efficiency                                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          1                    dram_read_transactions                                       Device Memory Read Transactions   134217854   134217854   134217854
          1                   dram_write_transactions                                      Device Memory Write Transactions        1431        1431        1431
          1                      dram_read_throughput                                         Device Memory Read Throughput  280.71GB/s  280.71GB/s  280.71GB/s
          1                     dram_write_throughput                                        Device Memory Write Throughput  3.0647MB/s  3.0647MB/s  3.0647MB/s
          1                          dram_utilization                                             Device Memory Utilization    High (7)    High (7)    High (7)
          1             half_precision_fu_utilization                              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                          ecc_transactions                                                      ECC Transactions           0           0           0
          1                            ecc_throughput                                                        ECC Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                           dram_read_bytes                                Total bytes read from DRAM to L2 cache  4294971328  4294971328  4294971328
          1                          dram_write_bytes                             Total bytes written from L2 cache to DRAM       45792       45792       45792

    Kernel: void cub::DeviceReduceKernel<cub::DeviceReducePolicy<long, long, int, thrust::plus<long>>::Policy600, thrust::cuda_cub::transform_input_iterator_t<long, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::zip_adj_not_predicate<thrust::equal_to<int>>>, long*, int, thrust::plus<long>>(long, int, long, cub::GridEvenShare<int>, thrust::plus<long>)
          1                             inst_per_warp                                                 Instructions per warp  2.4790e+04  2.4790e+04  2.4790e+04
          1                         branch_efficiency                                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                                             Warp Execution Efficiency      99.99%      99.99%      99.99%
          1         warp_nonpred_execution_efficiency                              Warp Non-Predicated Execution Efficiency      99.96%      99.96%      99.96%
          1                      inst_replay_overhead                                           Instruction Replay Overhead    0.000072    0.000074    0.000073
          1      shared_load_transactions_per_request                           Shared Memory Load Transactions Per Request    1.750000    1.750000    1.750000
          1     shared_store_transactions_per_request                          Shared Memory Store Transactions Per Request    2.000000    2.000000    2.000000
          1       local_load_transactions_per_request                            Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          1      local_store_transactions_per_request                           Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                                  Global Load Transactions Per Request   15.999999   15.999999   15.999999
          1              gst_transactions_per_request                                 Global Store Transactions Per Request    1.000000    1.000000    1.000000
          1                 shared_store_transactions                                             Shared Store Transactions       17920       17920       17920
          1                  shared_load_transactions                                              Shared Load Transactions        7840        7840        7840
          1                   local_load_transactions                                               Local Load Transactions           0           0           0
          1                  local_store_transactions                                              Local Store Transactions           0           0           0
          1                          gld_transactions                                              Global Load Transactions  1073741850  1073741850  1073741850
          1                          gst_transactions                                             Global Store Transactions        1120        1120        1120
          1                  sysmem_read_transactions                                       System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                                      System Memory Write Transactions           5           5           5
          1                      l2_read_transactions                                                  L2 Read Transactions   301989986   301990962   301990474
          1                     l2_write_transactions                                                 L2 Write Transactions        1133        1133        1133
          1                           global_hit_rate                                     Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
          1                            local_hit_rate                                                        Local Hit Rate       0.00%       0.00%       0.00%
          1                  gld_requested_throughput                                      Requested Global Load Throughput  478.92GB/s  478.93GB/s  478.93GB/s
          1                  gst_requested_throughput                                     Requested Global Store Throughput  523.82KB/s  523.83KB/s  523.82KB/s
          1                            gld_throughput                                                Global Load Throughput  538.78GB/s  538.80GB/s  538.79GB/s
          1                            gst_throughput                                               Global Store Throughput  2.0462MB/s  2.0462MB/s  2.0462MB/s
          1                     local_memory_overhead                                                 Local Memory Overhead       0.00%       0.00%       0.00%
          1                        tex_cache_hit_rate                                                Unified Cache Hit Rate      50.00%      50.00%      50.00%
          1                      l2_tex_read_hit_rate                                           L2 Hit Rate (Texture Reads)      45.92%      45.93%      45.92%
          1                     l2_tex_write_hit_rate                                          L2 Hit Rate (Texture Writes)      55.09%      55.54%      55.31%
          1                      tex_cache_throughput                                              Unified Cache Throughput  478.92GB/s  478.93GB/s  478.93GB/s
          1                    l2_tex_read_throughput                                         L2 Throughput (Texture Reads)  538.78GB/s  538.80GB/s  538.79GB/s
          1                   l2_tex_write_throughput                                        L2 Throughput (Texture Writes)  2.0462MB/s  2.0462MB/s  2.0462MB/s
          1                        l2_read_throughput                                                 L2 Throughput (Reads)  538.78GB/s  538.80GB/s  538.79GB/s
          1                       l2_write_throughput                                                L2 Throughput (Writes)  2.0699MB/s  2.0700MB/s  2.0699MB/s
          1                    sysmem_read_throughput                                         System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                                        System Memory Write Throughput  9.3535KB/s  9.3535KB/s  9.3525KB/s
          1                     local_load_throughput                                          Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    local_store_throughput                                         Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                    shared_load_throughput                                         Shared Memory Load Throughput  57.292MB/s  57.294MB/s  57.293MB/s
          1                   shared_store_throughput                                        Shared Memory Store Throughput  130.95MB/s  130.96MB/s  130.96MB/s
          1                            gld_efficiency                                         Global Memory Load Efficiency      88.89%      88.89%      88.89%
          1                            gst_efficiency                                        Global Memory Store Efficiency      25.00%      25.00%      25.00%
          1                    tex_cache_transactions                                            Unified Cache Transactions   268435462   268435462   268435462
          1                             flop_count_dp                           Floating Point Operations(Double Precision)           0           0           0
          1                         flop_count_dp_add                       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_fma                       Floating Point Operations(Double Precision FMA)           0           0           0
          1                         flop_count_dp_mul                       Floating Point Operations(Double Precision Mul)           0           0           0
          1                             flop_count_sp                           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add                       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_fma                       Floating Point Operations(Single Precision FMA)           0           0           0
          1                         flop_count_sp_mul                        Floating Point Operation(Single Precision Mul)           0           0           0
          1                     flop_count_sp_special                   Floating Point Operations(Single Precision Special)           0           0           0
          1                             inst_executed                                                 Instructions Executed   222119213   222119213   222119213
          1                               inst_issued                                                   Instructions Issued   222135122   222135633   222135377
          1                        sysmem_utilization                                             System Memory Utilization     Low (1)     Low (1)     Low (1)
          1                          stall_inst_fetch                              Issue Stall Reasons (Instructions Fetch)       0.39%       0.39%       0.39%
          1                     stall_exec_dependency                            Issue Stall Reasons (Execution Dependency)       8.02%       8.04%       8.03%
          1                   stall_memory_dependency                                    Issue Stall Reasons (Data Request)      24.95%      24.97%      24.96%
          1                             stall_texture                                         Issue Stall Reasons (Texture)      27.00%      27.05%      27.03%
          1                                stall_sync                                 Issue Stall Reasons (Synchronization)       2.24%       2.36%       2.30%
          1                               stall_other                                           Issue Stall Reasons (Other)      37.12%      37.19%      37.16%
          1          stall_constant_memory_dependency                              Issue Stall Reasons (Immediate constant)       0.00%       0.00%       0.00%
          1                           stall_pipe_busy                                       Issue Stall Reasons (Pipe Busy)       0.08%       0.08%       0.08%
          1                         shared_efficiency                                              Shared Memory Efficiency       4.08%       4.08%       4.08%
          1                                inst_fp_32                                               FP Instructions(Single)           0           0           0
          1                                inst_fp_64                                               FP Instructions(Double)           0           0           0
          1                              inst_integer                                                  Integer Instructions  4518082863  4518082863  4518082863
          1                          inst_bit_convert                                              Bit-Convert Instructions           0           0           0
          1                              inst_control                                             Control-Flow Instructions   107949310   107949310   107949310
          1                        inst_compute_ld_st                                               Load/Store Instructions  2147784926  2147784926  2147784926
          1                                 inst_misc                                                     Misc Instructions   328147242   328147242   328147242
          1           inst_inter_thread_communication                                             Inter-Thread Instructions     2867200     2867200     2867200
          1                               issue_slots                                                           Issue Slots   198607801   198608243   198608022
          1                                 cf_issued                                      Issued Control-Flow Instructions     3402572     3402572     3402572
          1                               cf_executed                                    Executed Control-Flow Instructions     3402572     3402572     3402572
          1                               ldst_issued                                        Issued Load/Store Instructions   268584422   268584422   268584422
          1                             ldst_executed                                      Executed Load/Store Instructions    67266790    67266790    67266790
          1                       atomic_transactions                                                   Atomic Transactions           0           0           0
          1           atomic_transactions_per_request                                       Atomic Transactions Per Request    0.000000    0.000000    0.000000
          1                      l2_atomic_throughput                                       L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          1                    l2_atomic_transactions                                     L2 Transactions (Atomic requests)           0           0           0
          1                  l2_tex_read_transactions                                       L2 Transactions (Texture Reads)   301989890   301989890   301989890
          1                     stall_memory_throttle                                 Issue Stall Reasons (Memory Throttle)       0.00%       0.00%       0.00%
          1                        stall_not_selected                                    Issue Stall Reasons (Not Selected)       0.06%       0.06%       0.06%
          1                 l2_tex_write_transactions                                      L2 Transactions (Texture Writes)        1120        1120        1120
          1                             flop_count_hp                             Floating Point Operations(Half Precision)           0           0           0
          1                         flop_count_hp_add                         Floating Point Operations(Half Precision Add)           0           0           0
          1                         flop_count_hp_mul                          Floating Point Operation(Half Precision Mul)           0           0           0
          1                         flop_count_hp_fma                         Floating Point Operations(Half Precision FMA)           0           0           0
          1                                inst_fp_16                                                 HP Instructions(Half)           0           0           0
          1                   sysmem_read_utilization                                        System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                                       System Memory Write Utilization     Low (1)     Low (1)     Low (1)
          1               pcie_total_data_transmitted                                           PCIe Total Data Transmitted           0         512         256
          1                  pcie_total_data_received                                              PCIe Total Data Received           0           0           0
          1                inst_executed_global_loads                              Warp level instructions for global loads    67108870    67108870    67108870
          1                 inst_executed_local_loads                               Warp level instructions for local loads           0           0           0
          1                inst_executed_shared_loads                              Warp level instructions for shared loads        4480        4480        4480
          1               inst_executed_surface_loads                             Warp level instructions for surface loads           0           0           0
          1               inst_executed_global_stores                             Warp level instructions for global stores        1120        1120        1120
          1                inst_executed_local_stores                              Warp level instructions for local stores           0           0           0
          1               inst_executed_shared_stores                             Warp level instructions for shared stores        8960        8960        8960
          1              inst_executed_surface_stores                            Warp level instructions for surface stores           0           0           0
          1              inst_executed_global_atomics                  Warp level instructions for global atom and atom cas           0           0           0
          1           inst_executed_global_reductions                         Warp level instructions for global reductions           0           0           0
          1             inst_executed_surface_atomics                 Warp level instructions for surface atom and atom cas           0           0           0
          1          inst_executed_surface_reductions                        Warp level instructions for surface reductions           0           0           0
          1              inst_executed_shared_atomics                  Warp level shared instructions for atom and atom CAS           0           0           0
          1                     inst_executed_tex_ops                                   Warp level instructions for texture           0           0           0
          1                      l2_global_load_bytes       Bytes read from L2 for misses in Unified Cache for global loads  9663676480  9663676480  9663676480
          1                       l2_local_load_bytes        Bytes read from L2 for misses in Unified Cache for local loads           0           0           0
          1                     l2_surface_load_bytes      Bytes read from L2 for misses in Unified Cache for surface loads           0           0           0
          1               l2_local_global_store_bytes   Bytes written to L2 from Unified Cache for local and global stores.       35840       35840       35840
          1                 l2_global_reduction_bytes          Bytes written to L2 from Unified cache for global reductions           0           0           0
          1              l2_global_atomic_store_bytes             Bytes written to L2 from Unified cache for global atomics           0           0           0
          1                    l2_surface_store_bytes            Bytes written to L2 from Unified Cache for surface stores.           0           0           0
          1                l2_surface_reduction_bytes         Bytes written to L2 from Unified Cache for surface reductions           0           0           0
          1             l2_surface_atomic_store_bytes    Bytes transferred between Unified Cache and L2 for surface atomics           0           0           0
          1                      global_load_requests              Total number of global load requests from Multiprocessor   268435462   268435462   268435462
          1                       local_load_requests               Total number of local load requests from Multiprocessor           0           0           0
          1                     surface_load_requests             Total number of surface load requests from Multiprocessor           0           0           0
          1                     global_store_requests             Total number of global store requests from Multiprocessor        1120        1120        1120
          1                      local_store_requests              Total number of local store requests from Multiprocessor           0           0           0
          1                    surface_store_requests            Total number of surface store requests from Multiprocessor           0           0           0
          1                    global_atomic_requests            Total number of global atomic requests from Multiprocessor           0           0           0
          1                 global_reduction_requests         Total number of global reduction requests from Multiprocessor           0           0           0
          1                   surface_atomic_requests           Total number of surface atomic requests from Multiprocessor           0           0           0
          1                surface_reduction_requests        Total number of surface reduction requests from Multiprocessor           0           0           0
          1                         sysmem_read_bytes                                              System Memory Read Bytes           0           0           0
          1                        sysmem_write_bytes                                             System Memory Write Bytes         160         160         160
          1                           l2_tex_hit_rate                                                     L2 Cache Hit Rate      45.92%      45.93%      45.92%
          1                     texture_load_requests             Total number of texture Load requests from Multiprocessor           0           0           0
          1                     unique_warps_launched                                              Number of warps launched        8960        8960        8960
          1                             sm_efficiency                                               Multiprocessor Activity      99.89%      99.90%      99.90%
          1                        achieved_occupancy                                                    Achieved Occupancy    0.994128    0.994404    0.994266
          1                                       ipc                                                          Executed IPC    0.259407    0.259439    0.259423
          1                                issued_ipc                                                            Issued IPC    0.259458    0.259723    0.259590
          1                    issue_slot_utilization                                                Issue Slot Utilization       5.80%       5.81%       5.80%
          1                  eligible_warps_per_cycle                                       Eligible Warps Per Active Cycle    0.266970    0.267043    0.267007
          1                           tex_utilization                                             Unified Cache Utilization     Low (2)     Low (2)     Low (2)
          1                            l2_utilization                                                  L2 Cache Utilization     Low (3)     Low (3)     Low (3)
          1                        shared_utilization                                             Shared Memory Utilization     Low (1)     Low (1)     Low (1)
          1                       ldst_fu_utilization                                  Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                         cf_fu_utilization                                Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                    special_fu_utilization                                     Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        tex_fu_utilization                                     Texture Function Unit Utilization     Low (2)     Low (2)     Low (2)
          1           single_precision_fu_utilization                            Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           double_precision_fu_utilization                            Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                        flop_hp_efficiency                                            FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          1                        flop_sp_efficiency                                          FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          1                        flop_dp_efficiency                                          FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          1                    dram_read_transactions                                       Device Memory Read Transactions   163413796   163478281   163446038
          1                   dram_write_transactions                                      Device Memory Write Transactions        1910       23114       12512
          1                      dram_read_throughput                                         Device Memory Read Throughput  291.55GB/s  291.67GB/s  291.61GB/s
          1                     dram_write_throughput                                        Device Memory Write Throughput  3.4894MB/s  42.229MB/s  22.859MB/s
          1                          dram_utilization                                             Device Memory Utilization    High (7)    High (7)    High (7)
          1             half_precision_fu_utilization                              Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                          ecc_transactions                                                      ECC Transactions           0           0           0
          1                            ecc_throughput                                                        ECC Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                           dram_read_bytes                                Total bytes read from DRAM to L2 cache  5229241472  5231304992  5230273232
          1                          dram_write_bytes                             Total bytes written from L2 cache to DRAM       61120      739648      400384

@upsj
Copy link
Contributor Author

upsj commented Mar 11, 2022

This is definitely an issue specific to the implementation of count_if. With a quick-and-dirty hand-written implementation, I get performance on par with plain count_if for int and somewhere between count_if and unique_count for long.

#include "thrust/unique.h"
#include "thrust/device_vector.h"
#include "thrust/iterator/zip_iterator.h"
#include "thrust/tuple.h"
#include "thrust/sequence.h"

#include <chrono>
#include <iostream>


template <typename T>
__global__ __launch_bounds__(1024) void unique_count_raw(const T* data, int size, int* out_counts) {
	const auto tidx = threadIdx.x + blockIdx.x * blockDim.x;
	const auto num_threads = gridDim.x * blockDim.x;
	__shared__ int counts[32];
	int count{};
	const auto lane = tidx % 32;
	for (auto i = tidx; i < size - 1; i += num_threads) {
		count += data[i] != data[i + 1];
	}
	#pragma unroll
	for (int step = 1; step < 32; step *= 2) {
		count += __shfl_xor_sync(0xFFFFFFFFu, count, step);
	}
	if (lane == 0) {
		counts[threadIdx.x / 32] = count;
	}
	__syncthreads();
	if (threadIdx.x >= 32) {
		return;
	}
	count = counts[lane];
	#pragma unroll
	for (int step = 1; step < 32; step *= 2) {
		count += __shfl_xor_sync(0xFFFFFFFFu, count, step);
	}
	if (lane == 0) {
		out_counts[blockIdx.x] = count;
	}
}


__device__ __host__ long get(int a) { return a; }
__device__ __host__ long get(long a) { return a; }
template <typename A, typename B>
__device__ __host__ long get(thrust::tuple<A,B> a) { return thrust::get<0>(a) + thrust::get<1>(a); }

template <typename ForwardIt>
void benchmark(std::string desc, ForwardIt begin, ForwardIt end) {
	thrust::unique_count(begin, end);
	cudaDeviceSynchronize();
	const auto reps = 100;
	auto start = std::chrono::high_resolution_clock::now();
	int count1;
	int count2;
	for (int i = 0; i < reps; i++) {
		count1 = thrust::unique_count(begin, end);
	}
	cudaDeviceSynchronize();
	auto total = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
	cudaDeviceSynchronize();
	start = std::chrono::high_resolution_clock::now();
	for (int i = 0; i < reps; i++) {
		thrust::count_if(begin, end, []__host__ __device__(typename ForwardIt::value_type val) { return get(val) >= 0; });
	}
	cudaDeviceSynchronize();
	auto total2 = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
	cudaDeviceSynchronize();
	start = std::chrono::high_resolution_clock::now();
	thrust::device_vector<int> tmp_storage(1024);
	for (int i = 0; i < reps; i++) {
		unique_count_raw<<<1024, 1024>>>((&begin[0]).get(), end - begin, tmp_storage.data().get());
		count2 = 1 + thrust::reduce(tmp_storage.begin(), tmp_storage.end());
	}
	cudaDeviceSynchronize();
	auto total3 = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
	if (count1 != count2) {
		std::cerr << "Invalid: " << count1 << ',' << count2 << '\n';
	}
	auto memory = sizeof(typename ForwardIt::value_type) * (end - begin) * 1.0;
	std::cout << desc << ',' << (memory / total) << ',' << (memory / total2) << ',' << (memory / total3) << '\n';
}

int main() {
	{
		thrust::device_vector<int> data(1u << 30);
		thrust::sequence(data.begin(), data.end(), 0);
		for (long i = 1; i <= data.size(); i *= 2) {
			benchmark("int," + std::to_string(i), data.begin(), data.begin() + i);
		}
	}
	{
		thrust::device_vector<long> data(1u << 29);
		thrust::sequence(data.begin(), data.end(), 0);
		for (long i = 1; i <= data.size(); i *= 2) {
			benchmark("long," + std::to_string(i), data.begin(), data.begin() + i);
		}
	}
	/*{
		thrust::device_vector<int> data1(1u << 28);
		thrust::device_vector<long> data2(1u << 28);
		thrust::sequence(data1.begin(), data1.end(), 0);
		thrust::sequence(data2.begin(), data2.end(), 0);
		auto it = thrust::make_zip_iterator(thrust::make_tuple(data1.begin(), data2.begin()));
		for (long i = data1.size(); i <= data1.size(); i *= 2) {
			benchmark("pair<int long>," + std::to_string(i), it, it + i);
		}
	}*/
}

@upsj
Copy link
Contributor Author

upsj commented Mar 13, 2022

After taking a brief look over cub's Reduce implementation, I would guess that it is very well optimized for a particular setting, and that is pointer-like iterators that read exactly as much as their value_type and don't need to worry about caching at all. That is true for count_if, but not for count_unique. This imbalance, together with the fact that cub reads multiple entries in each thread with threads_per_block stride might lead to elements being evicted from L1 cache earlier than would be ideal?

Copy link
Collaborator

@ericniebler ericniebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! I have a few more suggested changes to the new forward_iterator_wrapper class, and a question for @allisonvacanti.

testing/unittest/iterator_helpers.h Outdated Show resolved Hide resolved
testing/unittest/iterator_helpers.h Outdated Show resolved Hide resolved
testing/unittest/iterator_helpers.h Outdated Show resolved Hide resolved

ASSERT_EQUAL(div_10_count, 3);
}
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, forward iterators always dispatch to the CPU? @allisonvacanti can you comment on that? I mean, it seems reasonable to me.

testing/unittest/iterator_helpers.h Outdated Show resolved Hide resolved
@upsj
Copy link
Contributor Author

upsj commented Mar 25, 2022

@ericniebler Thanks for the feedback, it's been a while since I've written truly generic code!

@ericniebler
Copy link
Collaborator

run tests

@jrhemstad
Copy link
Collaborator

@senior-zero any ideas about why this unique_count implementation is seeing 2x more global loads?

I suspect something is going on with strided accesses as a result of the CUB reduction implementation.


ASSERT_EQUAL(div_10_count, 3);
}
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be possible. Thrust packs some extra info into its iterator category. See thrust/iterator/iterator_categories.h for more info.

The iterator wrapper will need to test the base iterator using thrust::iterator_system<BaseIterator>::type, and define its iterator category to forward_host_iterator_tag if the result is host_system_tag, or forward_device_iterator_tag if the result is device_iterator_tag.

@gevtushenko
Copy link
Collaborator

@senior-zero any ideas about why this unique_count implementation is seeing 2x more global loads?

I suspect something is going on with strided accesses as a result of the CUB reduction implementation.

@jrhemstad I've profiled the latest version:

    const int n = 128 * 1024 * 1024;
    thrust::device_vector<std::size_t> a(n);
    thrust::sequence(a.begin(), a.end());
    thrust::unique_count(a.begin(), a.end());

Here are the results:

expected actual
int 0.5 GB 0.56 GB
size_t 1 GB 1.09 GB

Am I missing the reproducer that leads to 2x global loads?

@jrhemstad
Copy link
Collaborator

@senior-zero good question! I haven't tried running it myself. I was going off of @upsj's comment here #1619 (comment)

@upsj are you not seeing the perf regression anymore?

upsj and others added 6 commits April 17, 2022 21:23
Add a counting equivalent to unique_* algorithms
that can be used to allocate the correct amount of data
before actually filling it.

Addresses issue NVIDIA#1612
The interface specifies ForwardIterator,
not InputIterator
* use iterator traits
* use hidden friend operators
* fix member access operator

Co-authored-by: Eric Niebler <eniebler@boost.org>
@upsj
Copy link
Contributor Author

upsj commented Apr 17, 2022

@senior-zero @jrhemstad I was talking about load instruction executed, not amount of memory served by DRAM (i.e. gld_transactions, not dram_read_transactions from nvprof output) . The 2x amount of global memory loads there is to be expected, because we always need to load two adjacent elements per thread. Let me visualize the issue some more with some NSight Compute results (for int):
NSight Compute Memory Chart for inputs of type int
I'm comparing the DeviceReduceKernel from unique_count with the hand-written unique_count_raw from #1619 (comment), the rest is your sample code with size 512M instead of 128M.
We have a noticeably worse L1 hit rate and a higher total DRAM access volume compared to a simple hand-written implementation. For the reasons, I speculated in #1619 (comment)
At the same time, this is only on my 2060 RTX, and the hand-written kernel gives worse performance overall, probably due to the naive launch dimension setup. On the V100 and Titan X I tested, unique_count_raw beats unique_count consistently and by a margin.

@ericniebler
Copy link
Collaborator

run tests

1 similar comment
@alliepiper
Copy link
Collaborator

run tests

Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to block merging until the situation is https://github.com/NVIDIA/thrust/pull/1619/files#r866368806 is resolved.

@upsj
Copy link
Contributor Author

upsj commented May 6, 2022

@allisonvacanti I reverted the changes to still have them available in case you need them later. Does that work for you?

@alliepiper alliepiper dismissed their stale review May 6, 2022 16:27

Changes addressed.

@alliepiper
Copy link
Collaborator

I reverted the changes to still have them available in case you need them later. Does that work for you?

That's perfect, I'll note this in NVIDIA/cccl#679. Thanks!

run tests

@alliepiper
Copy link
Collaborator

Tests look good, merging. Thanks for the PR!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unique_count algorithm
6 participants