Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9605: [C++] Speed up aggregate min/max compute kernels on integer types #7871

Closed
wants to merge 8 commits into from

Conversation

frankdjx
Copy link
Contributor

@frankdjx frankdjx commented Jul 31, 2020

  1. Use BitBlockCounter to speedup the performance for typical 0.01% null probability data.
  2. Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler.
  3. Also add test case to cover different null probability.

@github-actions
Copy link

@frankdjx
Copy link
Contributor Author

I can trigger a benchmark action once #7870 get merged.

Below is the BM data for int types on my setup:

Before:
MinMaxKernelInt8/1048576/10000          847 us          845 us          828 bytes_per_second=1.15586G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             43.9 us         43.8 us        15738 bytes_per_second=22.294G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000         429 us          428 us         1637 bytes_per_second=2.28348G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            42.4 us         42.4 us        15878 bytes_per_second=23.0572G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000         295 us          294 us         2383 bytes_per_second=3.31751G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            42.1 us         42.0 us        16620 bytes_per_second=23.2245G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000         112 us          112 us         6309 bytes_per_second=8.70966G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            82.2 us         82.1 us         8537 bytes_per_second=11.8992G/s null_percent=0 size=1048.58k

After(AVX2):
MinMaxKernelInt8/1048576/10000         92.9 us         92.6 us         7568 bytes_per_second=10.5421G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/0             31.3 us         31.2 us        21832 bytes_per_second=31.2619G/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000        60.7 us         60.5 us        11501 bytes_per_second=16.1388G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/0            31.5 us         31.4 us        22316 bytes_per_second=31.1085G/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000        51.0 us         50.9 us        13841 bytes_per_second=19.1853G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0            31.8 us         31.7 us        22111 bytes_per_second=30.8189G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000        61.1 us         61.0 us        11610 bytes_per_second=16.016G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0            54.2 us         54.1 us        12935 bytes_per_second=18.0651G/s null_percent=0 size=1048.58k

AVX512:
MinMaxKernelInt32/1048576/10000       40.9 us         40.8 us        17151 bytes_per_second=23.9207G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/0           25.6 us         25.6 us        26669 bytes_per_second=38.2196G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000       34.5 us         34.4 us        20137 bytes_per_second=28.396G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/0           23.7 us         23.7 us        25949 bytes_per_second=41.2537G/s null_percent=0 size=1048.58k

@frankdjx frankdjx marked this pull request as draft August 5, 2020 08:05
@frankdjx frankdjx marked this pull request as ready for review August 10, 2020 00:53
@frankdjx
Copy link
Contributor Author

Ping. @wesm @pitrou

Could you help to review this? Similar approach to sum kernel, use compiler to vectorise the NoNulls part, use BitBlockCounter on the 0.01% data. #7870 add the benchmark item for MinMax kernel.

Thanks.

@ursabot
Copy link

ursabot commented Aug 13, 2020

no such option: --benchmark_filter

@frankdjx
Copy link
Contributor Author

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

@frankdjx
Copy link
Contributor Author

@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --benchmark-filter=MinMax

Below is the results for null_percent 0.01% and 0% on https://ci.ursalabs.org/#/builders/73/builds/101

                           benchmark         baseline         contender  change %                                           counters
3     MinMaxKernelInt8/1048576/10000  812.254 MiB/sec     7.952 GiB/sec   902.442  {'run_name': 'MinMaxKernelInt8/1048576/10000',...
31   MinMaxKernelInt16/1048576/10000    1.583 GiB/sec    12.895 GiB/sec   714.512  {'run_name': 'MinMaxKernelInt16/1048576/10000'...
16   MinMaxKernelInt32/1048576/10000    3.152 GiB/sec    16.605 GiB/sec   426.876  {'run_name': 'MinMaxKernelInt32/1048576/10000'...
2        MinMaxKernelInt64/1048576/0    5.289 GiB/sec    11.092 GiB/sec   109.708  {'run_name': 'MinMaxKernelInt64/1048576/0', 'r...
14   MinMaxKernelInt64/1048576/10000    6.222 GiB/sec    10.055 GiB/sec    61.610  {'run_name': 'MinMaxKernelInt64/1048576/10000'...
1        MinMaxKernelInt32/1048576/0   18.103 GiB/sec    26.301 GiB/sec    45.282  {'run_name': 'MinMaxKernelInt32/1048576/0', 'r...
15       MinMaxKernelInt16/1048576/0   18.086 GiB/sec    26.274 GiB/sec    45.269  {'run_name': 'MinMaxKernelInt16/1048576/0', 'r...
7         MinMaxKernelInt8/1048576/0   18.112 GiB/sec    26.210 GiB/sec    44.708  {'run_name': 'MinMaxKernelInt8/1048576/0', 'ru...
26  MinMaxKernelDouble/1048576/10000    1.063 GiB/sec     1.315 GiB/sec    23.759  {'run_name': 'MinMaxKernelDouble/1048576/10000...
23   MinMaxKernelFloat/1048576/10000  551.756 MiB/sec   674.455 MiB/sec    22.238  {'run_name': 'MinMaxKernelFloat/1048576/10000'...
0       MinMaxKernelDouble/1048576/0    1.205 GiB/sec     1.332 GiB/sec    10.600  {'run_name': 'MinMaxKernelDouble/1048576/0', '...
12       MinMaxKernelFloat/1048576/0  621.824 MiB/sec   607.146 MiB/sec    -2.361  {'run_name': 'MinMaxKernelFloat/1048576/0', 'r...

@pitrou
Copy link
Member

pitrou commented Aug 25, 2020

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

@frankdjx
Copy link
Contributor Author

@jianxind Sorry for the delay. Could you please rebase this PR? It looks like there are some conflicts now.

No problem at all. Rebased now. Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Some comments still.

cpp/src/arrow/compute/api_aggregate.h Show resolved Hide resolved
cpp/src/arrow/compute/kernels/aggregate_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/aggregate_test.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Passing ARROW_USER_SIMD_LEVEL=none doesn't seem to impact the results. Is something amiss?

@frankdjx
Copy link
Contributor Author

frankdjx commented Sep 2, 2020

ARROW_USER_SIMD_LEVEL=none

Below is the cmd I used, and compiler vectorise happens only on Int types.

ARROW_USER_SIMD_LEVEL=avx2 ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64
ARROW_USER_SIMD_LEVEL=none ./release/arrow-compute-aggregate-benchmark --benchmark_filter=MinMaxKernelInt64

@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

frankdjx and others added 8 commits September 2, 2020 13:21
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
Signed-off-by: Frank Du <frank.du@intel.com>
@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Rebased.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Member

pitrou commented Sep 2, 2020

Test failures are unrelated, will merge.

@pitrou pitrou closed this in 5a3291c Sep 2, 2020
@frankdjx frankdjx deleted the kernel_min_max branch September 3, 2020 00:43
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
…er types

1. Use BitBlockCounter to speedup the performance for typical 0.01% null probability data.
2. Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler.
3. Also add test case to cover different null probability.

Closes apache#7871 from jianxind/kernel_min_max

Lead-authored-by: Frank Du <frank.du@intel.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…er types

1. Use BitBlockCounter to speedup the performance for typical 0.01% null probability data.
2. Enable compiler auto SIMD vectorize for no-nulls on int types. Float/Double use fmin/fmax to handle NaN which can't be vectorize by compiler.
3. Also add test case to cover different null probability.

Closes apache#7871 from jianxind/kernel_min_max

Lead-authored-by: Frank Du <frank.du@intel.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@felipecrv
Copy link
Contributor

Ah, I also had -DARROW_SIMD_LEVEL=AVX2 in CMake. Without it I do see a difference.

I've came here after looking at the code and being confused. It sounds like there was never a need to instantiate templates with a SimdLevel parameter given that the SIMD comes from compiler auto-vectorization instead of the kernel code doing something different.

I might be wrong, in that case, I would love to have a pointer to the specialized code.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

@felipecrv
Copy link
Contributor

@felipecrv I may be misunderstanding your question, but the SimdLevel template parameter is used as a disambiguation against ODR issues (you would get multiple times the same function with a different generated code due to different compiler options).

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel. There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

I understand that, but if no specialization exists for different SIMD levels, we don't need more than the SimdLevel::NONE variation of the kernel.

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

macro(append_runtime_avx512_src SRCS SRC)
if(ARROW_HAVE_RUNTIME_AVX512)
list(APPEND ${SRCS} ${SRC})
set_source_files_properties(${SRC} PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
set_source_files_properties(${SRC} PROPERTIES COMPILE_FLAGS ${ARROW_AVX512_FLAG})
endif()
endmacro()

There is no need to populate the kernel array for SimdLevel::AVX2 or AVX512 if we are not performing any sort of runtime dispatching based on runtime CPU capability checks.

We do:

// Dispatch as the CPU feature
#if defined(ARROW_HAVE_RUNTIME_AVX512) || defined(ARROW_HAVE_RUNTIME_AVX2)
auto cpu_info = arrow::internal::CpuInfo::GetInstance();
#endif
#if defined(ARROW_HAVE_RUNTIME_AVX512)
if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX512)) {
if (kernel_matches[SimdLevel::AVX512]) {
return kernel_matches[SimdLevel::AVX512];
}
}
#endif
#if defined(ARROW_HAVE_RUNTIME_AVX2)
if (cpu_info->IsSupported(arrow::internal::CpuInfo::AVX2)) {
if (kernel_matches[SimdLevel::AVX2]) {
return kernel_matches[SimdLevel::AVX2];
}
}
#endif

@felipecrv
Copy link
Contributor

The source code is usually the same for all variations, but the generated code (which matters for ODR) varies thanks to different compiler options.

OK. Now I get it. The compiler options are source file specific and not global to the entire build.

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

Right :-) I agree it's a bit difficult to follow.

pitrou added a commit that referenced this pull request Sep 3, 2024
…e same code in different compilation units (#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] #7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: #43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
…rom the same code in different compilation units (apache#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] apache#7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: apache#43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…rom the same code in different compilation units (apache#43720)

### Rationale for this change

More than once I've been confused about how the `SimdLevel` template parameters on these kernel classes affect dispatching of kernels based on SIMD support detection at runtime [1] given that nothing in the code changes based on the parameters.

What matters is the compilation unit in which the templates are instantiated. Different compilation units get different compilation parameters. The SimdLevel parameters don't really affect the code that gets generated (!), they only serve as a way to avoid duplication of symbols in the compiled objects.

This PR organizes the code to make this more explicit.

[1] apache#7871 (comment)

### What changes are included in this PR?

 - Introduction of aggregate_basic-inl.h
 - Moving of the impls in `aggregate_basic-inl.h` to an anonymous namespace
 - Grouping of code based on the function they implement (`Sum`, `Mean`, and `MinMax`)

### Are these changes tested?

By the compilation process, existing tests, and benchmarks.

* GitHub Issue: apache#43719

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants