Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Aug 26, 2024

Rationale for this change

You can find the background in #43693.

By looking at how Visit_avx2/VisitNulls_avx2's non-simd counterparts (Visit/VisitNulls) are used, I found they are solely for decoding rows from the build side of the join. So I added AVX2 versions for those decoding methods and wired Visit_avx2/VisitNulls_avx2.

What changes are included in this PR?

  1. Split the decoding methods into smaller pieces to make each of them able to cooperate with the AVX2 version.
  2. Concrete AVX2 specialized functions utilizing the Visit*_avx2 functions to decode fixed-length/offsets/var-length/nulls of the row table.
  3. Fix some bugs in the original Visit*_avx2 functions.
  4. Related benchmarks.

Are these changes tested?

No new tests needed.

The benchmarking result is a bit complicated, I put them in comment #43832 (comment).

Are there any user-facing changes?

No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.

@zanmato1984
Copy link
Contributor Author

@github-actions crossbow submit -g cpp

@zanmato1984
Copy link
Contributor Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Aug 26, 2024

Benchmark runs are scheduled for commit e2af277. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Copy link

Revision: e2af277

Submitted crossbow builds: ursacomputing/crossbow @ actions-1db8e52faf

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

Copy link

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit e2af277.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@zanmato1984 zanmato1984 changed the title GH-43693: [C++][Acero] Support AVX swiss join decoding GH-43693: [C++][Acero] Support AVX2 swiss join decoding Aug 29, 2024
@zanmato1984
Copy link
Contributor Author

The merge/rebase has now been fixed.

@zanmato1984
Copy link
Contributor Author

Hi @pitrou , do you get sometime to take a look at this? Thank you.

cc @westonpace @felipecrv @mapleFU

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and questions below. Are we sure these are tested thoroughly enough?

cpp/src/arrow/acero/swiss_join_avx2.cc Show resolved Hide resolved
cpp/src/arrow/acero/swiss_join.cc Outdated Show resolved Hide resolved
Comment on lines +329 to +336
uint64_t* dst = reinterpret_cast<uint64_t*>(
output->mutable_data(1) + num_bytes * (output_start_row + i));
const uint64_t* src = reinterpret_cast<const uint64_t*>(ptr);
for (uint32_t word_id = 0;
word_id < bit_util::CeilDiv(num_bytes, sizeof(uint64_t)); ++word_id) {
arrow::util::SafeStore<uint64_t>(dst + word_id,
arrow::util::SafeLoad(src + word_id));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a crude hand-written memcpy that overshoots the copy length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is.

Comment on lines 365 to 373
uint64_t* dst = reinterpret_cast<uint64_t*>(
output->mutable_data(2) + reinterpret_cast<const uint32_t*>(
output->mutable_data(1))[output_start_row + i]);
const uint64_t* src = reinterpret_cast<const uint64_t*>(ptr);
for (uint32_t word_id = 0;
word_id < bit_util::CeilDiv(num_bytes, sizeof(uint64_t)); ++word_id) {
arrow::util::SafeStore<uint64_t>(dst + word_id,
arrow::util::SafeLoad(src + word_id));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question re memcpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too.

const uint32_t* row_ids) const {
RowArrayAccessor::VisitNulls(
rows_, column_id, num_rows_to_append, row_ids, [&](int i, uint8_t value) {
bit_util::SetBitTo(output->mutable_data(0), output_start_row + i, value == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, so the convention is that the null bytes in the RowArray store a 1 for a null value, and 0 for a non-null value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

cpp/src/arrow/acero/swiss_join_avx2.cc Show resolved Hide resolved
cpp/src/arrow/acero/swiss_join_avx2.cc Outdated Show resolved Hide resolved
cpp/src/arrow/acero/swiss_join_avx2.cc Show resolved Hide resolved
@@ -194,17 +202,19 @@ int RowArrayAccessor::Visit_avx2(const RowTableImpl& rows, int column_id, int nu
//
const uint8_t* row_ptr_base = rows.data(2);
const RowTableImpl::offset_type* row_offsets = rows.offsets();
auto row_offsets_i64 =
reinterpret_cast<const arrow::util::int64_for_gather_t*>(row_offsets);
for (int i = 0; i < num_rows / unroll; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take the opportunity to rename all these unroll constants to kUnroll?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cpp/src/arrow/acero/swiss_join_avx2.cc Show resolved Hide resolved
@zanmato1984
Copy link
Contributor Author

Some comments and questions below. Are we sure these are tested thoroughly enough?

Just a quick answer to this specific question: yes the code changed are in a very common path that almost every swiss join case will run into, and my experience of developing/debugging this feature told me so too. So I'm positive that existing tests exercise them well.

And thank you for the rest of the thorough comments. Will get to them later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants