-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash Join Vectorized collision checking #6724
Conversation
indices_right: UInt32Array, | ||
left_arrays: &[ArrayRef], | ||
right_arrays: &[ArrayRef], | ||
null_equals_null: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK null_equals_null
is not supported in the arrow kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - working on implementing this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about folding?
pub fn equal_rows_arr(
indices_left: UInt64Array,
indices_right: UInt32Array,
left_arrays: &[ArrayRef],
right_arrays: &[ArrayRef],
null_equals_null: bool,
) -> Result<(UInt64Array, UInt32Array)> {
let mut iter = left_arrays.iter().zip(right_arrays.iter());
let (first_left, first_right) = iter.next().ok_or_else(|| {
DataFusionError::Internal("At least one array should be provided for both left and right".to_string())
})?;
let arr_left = take(first_left.as_ref(), &indices_left, None)?;
let arr_right = take(first_right.as_ref(), &indices_right, None)?;
let mut equal = eq_dyn_null(&arr_left, &arr_right, null_equals_null)?;
// Use map and try_fold to iterate over the remaining pairs of arrays.
// In each iteration, take is used on the pair of arrays and their equality is determined.
// The results are then folded (combined) using the and function to get a final equality result.
equal = iter
.map(|(left, right)| {
let arr_left = take(left.as_ref(), &indices_left, None)?;
let arr_right = take(right.as_ref(), &indices_right, None)?;
eq_dyn_null(arr_left.as_ref(), arr_right.as_ref(), null_equals_null)
})
.try_fold(equal, |acc, res| {
let equal2 = res?;
and(&acc, &equal2)
})?;
// Continue
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we do not need to take ownership of indices_left
and indices_right
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -1097,6 +1098,49 @@ pub fn equal_rows( | |||
err.unwrap_or(Ok(res)) | |||
} | |||
|
|||
fn eq_dyn_null( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A null element can occur in an Int32Array
as well. Maybe we should insert an option to arrow kernel, in arrow-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can open an issue on arrow-rs for this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a significant improvement in the equality check, nice work. Maybe then the |
LGTM, nice work @Dandandan thanks for the effort. |
Thanks for the feedback! |
Btw I think this can be applied to the symmetric hash join as well without changes. |
Which issue does this PR close?
Closes #50
Rationale for this change
Speed up for hash joins:
What changes are included in this PR?
Changes the collision checking to be vectorized.
Are these changes tested?
Are there any user-facing changes?