-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43693: [C++][Acero] Support AVX2 swiss join decoding #43832
base: main
Are you sure you want to change the base?
Conversation
@github-actions crossbow submit -g cpp |
@ursabot please benchmark |
Benchmark runs are scheduled for commit e2af277. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Revision: e2af277 Submitted crossbow builds: ursacomputing/crossbow @ actions-1db8e52faf |
Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit e2af277. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
The merge/rebase has now been fixed. |
Hi @pitrou , do you get sometime to take a look at this? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments and questions below. Are we sure these are tested thoroughly enough?
uint64_t* dst = reinterpret_cast<uint64_t*>( | ||
output->mutable_data(1) + num_bytes * (output_start_row + i)); | ||
const uint64_t* src = reinterpret_cast<const uint64_t*>(ptr); | ||
for (uint32_t word_id = 0; | ||
word_id < bit_util::CeilDiv(num_bytes, sizeof(uint64_t)); ++word_id) { | ||
arrow::util::SafeStore<uint64_t>(dst + word_id, | ||
arrow::util::SafeLoad(src + word_id)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a crude hand-written memcpy that overshoots the copy length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is.
cpp/src/arrow/acero/swiss_join.cc
Outdated
uint64_t* dst = reinterpret_cast<uint64_t*>( | ||
output->mutable_data(2) + reinterpret_cast<const uint32_t*>( | ||
output->mutable_data(1))[output_start_row + i]); | ||
const uint64_t* src = reinterpret_cast<const uint64_t*>(ptr); | ||
for (uint32_t word_id = 0; | ||
word_id < bit_util::CeilDiv(num_bytes, sizeof(uint64_t)); ++word_id) { | ||
arrow::util::SafeStore<uint64_t>(dst + word_id, | ||
arrow::util::SafeLoad(src + word_id)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question re memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too.
const uint32_t* row_ids) const { | ||
RowArrayAccessor::VisitNulls( | ||
rows_, column_id, num_rows_to_append, row_ids, [&](int i, uint8_t value) { | ||
bit_util::SetBitTo(output->mutable_data(0), output_start_row + i, value == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, so the convention is that the null bytes in the RowArray store a 1 for a null value, and 0 for a non-null value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
@@ -194,17 +202,19 @@ int RowArrayAccessor::Visit_avx2(const RowTableImpl& rows, int column_id, int nu | |||
// | |||
const uint8_t* row_ptr_base = rows.data(2); | |||
const RowTableImpl::offset_type* row_offsets = rows.offsets(); | |||
auto row_offsets_i64 = | |||
reinterpret_cast<const arrow::util::int64_for_gather_t*>(row_offsets); | |||
for (int i = 0; i < num_rows / unroll; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take the opportunity to rename all these unroll
constants to kUnroll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Just a quick answer to this specific question: yes the code changed are in a very common path that almost every swiss join case will run into, and my experience of developing/debugging this feature told me so too. So I'm positive that existing tests exercise them well. And thank you for the rest of the thorough comments. Will get to them later. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Rationale for this change
You can find the background in #43693.
By looking at how
Visit_avx2/VisitNulls_avx2
's non-simd counterparts (Visit/VisitNulls
) are used, I found they are solely for decoding rows from the build side of the join. So I added AVX2 versions for those decoding methods and wiredVisit_avx2/VisitNulls_avx2
.What changes are included in this PR?
Visit*_avx2
functions to decode fixed-length/offsets/var-length/nulls of the row table.Visit*_avx2
functions.Are these changes tested?
No new tests needed.
The benchmarking result is a bit complicated, I put them in comment #43832 (comment).
Are there any user-facing changes?
No changes other than positive performance improvement. Users can expect such improvement for hash joins related workload. Nevertheless the improvement degree highly depends on not only the workload, but also the CPU models. For Intel CPUs from Skylake to Icelake/Tigerlake, which suffer the performance degradation of AVX2 gather because of an vulnerability mitigation of Intel's (detailed in #43832 (comment)), the improvement is less significant - single digit percent. Other models, e.g. AMD, and the most recent Intel, can achieve better improvement up to 30%.