ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

frankdjx · 2020-07-31T07:11:55Z

Use BitBlockCounter to speedup the performance for typical 0.01% null probability data.
Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler.
Also add test case to cover different null probability.

github-actions · 2020-07-31T07:19:34Z

https://issues.apache.org/jira/browse/ARROW-9605

frankdjx · 2020-07-31T07:29:32Z

I can trigger a benchmark action once #7870 get merged.

Below is the BM data for int types on my setup:

Before:
MinMaxKernelInt8/1048576/10000          847 us          845 us          828 bytes_per_second=1.15586G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             43.9 us         43.8 us        15738 bytes_per_second=22.294G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000         429 us          428 us         1637 bytes_per_second=2.28348G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            42.4 us         42.4 us        15878 bytes_per_second=23.0572G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000         295 us          294 us         2383 bytes_per_second=3.31751G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            42.1 us         42.0 us        16620 bytes_per_second=23.2245G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000         112 us          112 us         6309 bytes_per_second=8.70966G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            82.2 us         82.1 us         8537 bytes_per_second=11.8992G/s null_percent=0 size=1048.58k

After(AVX2):
MinMaxKernelInt8/1048576/10000         92.9 us         92.6 us         7568 bytes_per_second=10.5421G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             31.3 us         31.2 us        21832 bytes_per_second=31.2619G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000        60.7 us         60.5 us        11501 bytes_per_second=16.1388G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            31.5 us         31.4 us        22316 bytes_per_second=31.1085G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000        51.0 us         50.9 us        13841 bytes_per_second=19.1853G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            31.8 us         31.7 us        22111 bytes_per_second=30.8189G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000        61.1 us         61.0 us        11610 bytes_per_second=16.016G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            54.2 us         54.1 us        12935 bytes_per_second=18.0651G/s null_percent=0 size=1048.58k

AVX512:
MinMaxKernelInt32/1048576/10000       40.9 us         40.8 us        17151 bytes_per_second=23.9207G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0           25.6 us         25.6 us        26669 bytes_per_second=38.2196G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000       34.5 us         34.4 us        20137 bytes_per_second=28.396G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0           23.7 us         23.7 us        25949 bytes_per_second=41.2537G/s null_percent=0 size=1048.58k

frankdjx · 2020-08-12T07:32:13Z

Ping. @wesm @pitrou

Could you help to review this? Similar approach to sum kernel, use compiler to vectorise the NoNulls part, use BitBlockCounter on the 0.01% data. #7870 add the benchmark item for MinMax kernel.

Thanks.

ursabot · 2020-08-13T00:58:16Z

no such option: --benchmark_filter

frankdjx · 2020-08-13T00:59:35Z

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

frankdjx · 2020-08-13T01:22:55Z

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

Below is the results for null_percent 0.01% and 0% on https://ci.ursalabs.org/#/builders/73/builds/101

                           benchmark         baseline         contender  change %                                           counters
3     MinMaxKernelInt8/1048576/10000  812.254 MiB/sec     7.952 GiB/sec   902.442  {'run_name': 'MinMaxKernelInt8/1048576/10000',...
31   MinMaxKernelInt16/1048576/10000    1.583 GiB/sec    12.895 GiB/sec   714.512  {'run_name': 'MinMaxKernelInt16/1048576/10000'...
16   MinMaxKernelInt32/1048576/10000    3.152 GiB/sec    16.605 GiB/sec   426.876  {'run_name': 'MinMaxKernelInt32/1048576/10000'...
2        MinMaxKernelInt64/1048576/0    5.289 GiB/sec    11.092 GiB/sec   109.708  {'run_name': 'MinMaxKernelInt64/1048576/0', 'r...
14   MinMaxKernelInt64/1048576/10000    6.222 GiB/sec    10.055 GiB/sec    61.610  {'run_name': 'MinMaxKernelInt64/1048576/10000'...
1        MinMaxKernelInt32/1048576/0   18.103 GiB/sec    26.301 GiB/sec    45.282  {'run_name': 'MinMaxKernelInt32/1048576/0', 'r...
15       MinMaxKernelInt16/1048576/0   18.086 GiB/sec    26.274 GiB/sec    45.269  {'run_name': 'MinMaxKernelInt16/1048576/0', 'r...
7         MinMaxKernelInt8/1048576/0   18.112 GiB/sec    26.210 GiB/sec    44.708  {'run_name': 'MinMaxKernelInt8/1048576/0', 'ru...
26  MinMaxKernelDouble/1048576/10000    1.063 GiB/sec     1.315 GiB/sec    23.759  {'run_name': 'MinMaxKernelDouble/1048576/10000...
23   MinMaxKernelFloat/1048576/10000  551.756 MiB/sec   674.455 MiB/sec    22.238  {'run_name': 'MinMaxKernelFloat/1048576/10000'...
0       MinMaxKernelDouble/1048576/0    1.205 GiB/sec     1.332 GiB/sec    10.600  {'run_name': 'MinMaxKernelDouble/1048576/0', '...
12       MinMaxKernelFloat/1048576/0  621.824 MiB/sec   607.146 MiB/sec    -2.361  {'run_name': 'MinMaxKernelFloat/1048576/0', 'r...

pitrou · 2020-08-25T12:25:51Z

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

frankdjx · 2020-08-26T02:21:37Z

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

No problem at all. Rebased now. Thanks.

pitrou

Thanks for the updates. Some comments still.

cpp/src/arrow/compute/api_aggregate.h

cpp/src/arrow/compute/kernels/aggregate_test.cc

pitrou · 2020-09-02T10:35:40Z

Passing ARROW_USER_SIMD_LEVEL=none doesn't seem to impact the results. Is something amiss?

frankdjx · 2020-09-02T10:55:30Z

ARROW_USER_SIMD_LEVEL=none

Below is the cmd I used, and compiler vectorise happens only on Int types.

ARROW_USER_SIMD_LEVEL=avx2 ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64
ARROW_USER_SIMD_LEVEL=none ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64

pitrou · 2020-09-02T11:10:27Z

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

Signed-off-by: Frank Du <frank.du@intel.com>

This reverts commit 8b5b1a6.

Signed-off-by: Frank Du <frank.du@intel.com>

pitrou · 2020-09-02T11:24:09Z

Rebased.

pitrou

+1

pitrou

+1

pitrou · 2020-09-02T13:35:19Z

Test failures are unrelated, will merge.

…er types 1. Use BitBlockCounter to speedup the performance for typical 0.01% null probability data. 2. Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler. 3. Also add test case to cover different null probability. Closes apache#7871 from jianxind/kernel_min_max Lead-authored-by: Frank Du <frank.du@intel.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

felipecrv · 2024-08-14T20:52:10Z

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

I've came here after looking at the code and being confused. It sounds like there was never a need to instantiate templates with a SimdLevel parameter given that the SIMD comes from compiler auto-vectorization instead of the kernel code doing something different.

I might be wrong, in that case, I would love to have a pointer to the specialized code.

pitrou · 2024-08-15T08:29:02Z

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

felipecrv · 2024-08-15T15:57:25Z

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel. There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

pitrou · 2024-08-15T16:02:28Z

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel.

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

arrow/cpp/src/arrow/CMakeLists.txt

Lines 353 to 359 in 8b634ad

    
           macro(append_runtime_avx512_src SRCS SRC) 
        
             if(ARROW_HAVE_RUNTIME_AVX512) 
        
               list(APPEND ${SRCS} ${SRC}) 
        
               set_source_files_properties(${SRC} PROPERTIES SKIP_PRECOMPILE_HEADERS ON) 
        
               set_source_files_properties(${SRC} PROPERTIES COMPILE_FLAGS ${ARROW_AVX512_FLAG}) 
        
             endif() 
        
           endmacro()

There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

We do:

arrow/cpp/src/arrow/compute/function.cc

Lines 133 to 150 in 8b634ad

    
             // Dispatch as the CPU feature 
        
           #if defined(ARROW_HAVE_RUNTIME_AVX512) || defined(ARROW_HAVE_RUNTIME_AVX2) 
        
             auto cpu_info = arrow::internal::CpuInfo::GetInstance(); 
        
           #endif 
        
           #if defined(ARROW_HAVE_RUNTIME_AVX512) 
        
             if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX512)) { 
        
               if (kernel_matches[SimdLevel::AVX512]) { 
        
                 return kernel_matches[SimdLevel::AVX512]; 
        
               } 
        
             } 
        
           #endif 
        
           #if defined(ARROW_HAVE_RUNTIME_AVX2) 
        
             if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX2)) { 
        
               if (kernel_matches[SimdLevel::AVX2]) { 
        
                 return kernel_matches[SimdLevel::AVX2]; 
        
               } 
        
             } 
        
           #endif

felipecrv · 2024-08-15T16:08:50Z

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

OK. Now I get it. The compiler options are source file specific and not global to the entire build.

pitrou · 2024-08-15T16:10:21Z

Right :-) I agree it's a bit difficult to follow.

…e same code in different compilation units (#43720) ### Rationale for this change More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters. What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects. This PR organizes the code to make this more explicit. [1] #7871 (comment) ### What changes are included in this PR? - Introduction of aggregate_basic-inl.h - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`) ### Are these changes tested? By the compilation process, existing tests, and benchmarks. * GitHub Issue: #43719 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…rom the same code in different compilation units (apache#43720) ### Rationale for this change More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters. What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects. This PR organizes the code to make this more explicit. [1] apache#7871 (comment) ### What changes are included in this PR? - Introduction of aggregate_basic-inl.h - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`) ### Are these changes tested? By the compilation process, existing tests, and benchmarks. * GitHub Issue: apache#43719 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

frankdjx marked this pull request as draft August 5, 2020 08:05

frankdjx force-pushed the kernel_min_max branch from 9141619 to 553266e Compare August 10, 2020 00:33

frankdjx marked this pull request as ready for review August 10, 2020 00:53

frankdjx force-pushed the kernel_min_max branch from 553266e to 4995295 Compare August 12, 2020 06:48

frankdjx force-pushed the kernel_min_max branch from 4995295 to 1c56802 Compare August 13, 2020 00:56

frankdjx force-pushed the kernel_min_max branch from 1c56802 to c9e781d Compare August 26, 2020 01:30

pitrou reviewed Aug 26, 2020

View reviewed changes

cpp/src/arrow/compute/api_aggregate.h Show resolved Hide resolved

cpp/src/arrow/compute/kernels/aggregate_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/aggregate_test.cc Outdated Show resolved Hide resolved

frankdjx and others added 8 commits September 2, 2020 13:21

Add random test for MinMax

14968fd

Signed-off-by: Frank Du <frank.du@intel.com>

Move min/max to header and rework with BitBlockCounter

b81cabe

Signed-off-by: Frank Du <frank.du@intel.com>

Add avx version

27d3c59

Signed-off-by: Frank Du <frank.du@intel.com>

Add ValidateMinMaxIsNull for RandomNullArrayMinMax

09cba63

Signed-off-by: Frank Du <frank.du@intel.com>

Revert "Add ValidateMinMaxIsNull for RandomNullArrayMinMax"

c70342b

This reverts commit 8b5b1a6.

Add is_valid for the null data test

9c68794

Signed-off-by: Frank Du <frank.du@intel.com>

Correct a comment

b3f05dd

Signed-off-by: Frank Du <frank.du@intel.com>

Some nits

ab6f2be

pitrou force-pushed the kernel_min_max branch from 9d84640 to ab6f2be Compare September 2, 2020 11:24

pitrou approved these changes Sep 2, 2020

View reviewed changes

pitrou closed this in 5a3291c Sep 2, 2020

frankdjx deleted the kernel_min_max branch September 3, 2020 00:43

asfimport mentioned this pull request Sep 2, 2020

[C++] Optimize performance for aggregate min/max compute kernels #25667

Closed

felipecrv mentioned this pull request Aug 15, 2024

GH-43687: [C++] Compute: fix register kernel SimdLevel for AddMinMax512AggKernels #43704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

frankdjx commented Jul 31, 2020 •

edited

Loading

github-actions bot commented Jul 31, 2020

frankdjx commented Jul 31, 2020

frankdjx commented Aug 12, 2020

ursabot commented Aug 13, 2020

frankdjx commented Aug 13, 2020

frankdjx commented Aug 13, 2020

pitrou commented Aug 25, 2020

frankdjx commented Aug 26, 2020

pitrou left a comment

pitrou commented Sep 2, 2020

frankdjx commented Sep 2, 2020

pitrou commented Sep 2, 2020 •

edited

Loading

pitrou commented Sep 2, 2020

pitrou left a comment

pitrou left a comment

pitrou commented Sep 2, 2020

felipecrv commented Aug 14, 2024

pitrou commented Aug 15, 2024

felipecrv commented Aug 15, 2024

pitrou commented Aug 15, 2024

felipecrv commented Aug 15, 2024

pitrou commented Aug 15, 2024

ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

Conversation

frankdjx commented Jul 31, 2020 • edited Loading

github-actions bot commented Jul 31, 2020

frankdjx commented Jul 31, 2020

frankdjx commented Aug 12, 2020

ursabot commented Aug 13, 2020

frankdjx commented Aug 13, 2020

frankdjx commented Aug 13, 2020

pitrou commented Aug 25, 2020

frankdjx commented Aug 26, 2020

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Sep 2, 2020

frankdjx commented Sep 2, 2020

pitrou commented Sep 2, 2020 • edited Loading

pitrou commented Sep 2, 2020

pitrou left a comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Sep 2, 2020

felipecrv commented Aug 14, 2024

pitrou commented Aug 15, 2024

felipecrv commented Aug 15, 2024

pitrou commented Aug 15, 2024

felipecrv commented Aug 15, 2024

pitrou commented Aug 15, 2024

frankdjx commented Jul 31, 2020 •

edited

Loading

pitrou commented Sep 2, 2020 •

edited

Loading