-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Remove redundant copying of batches after FilterExec #835
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #835 +/- ##
============================================
+ Coverage 33.83% 33.95% +0.12%
- Complexity 870 880 +10
============================================
Files 112 112
Lines 42970 42993 +23
Branches 9466 9473 +7
============================================
+ Hits 14538 14598 +60
+ Misses 25446 25404 -42
- Partials 2986 2991 +5 ☔ View full report in Codecov by Sentry. |
// BEGIN Comet changes | ||
pub fn filter_record_batch( | ||
record_batch: &RecordBatch, | ||
predicate: &BooleanArray, | ||
) -> std::result::Result<RecordBatch, ArrowError> { | ||
// turn predicate into selection vector | ||
let mut sv = Int32Builder::with_capacity(predicate.true_count()); | ||
for i in 0..predicate.len() { | ||
if !predicate.is_null(i) && predicate.value(i) { | ||
sv.append_value(i as i32); | ||
} | ||
} | ||
let sv = sv.finish(); | ||
// note that this does not unpack dictionary-encoded arrays | ||
take_record_batch(record_batch, &sv) | ||
} | ||
// END Comet changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the modification to FilterExec
@mbutrovich you may be interested in reviewing this PR |
predicate: &BooleanArray, | ||
) -> std::result::Result<RecordBatch, ArrowError> { | ||
// turn predicate into selection vector | ||
let mut sv = Int32Builder::with_capacity(predicate.true_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably special casing the edge case on predicate.true_count()
== record_batch.num_rows()
and defaulting to filter
otherwise would avoid regressing "normal filtering"
} | ||
let sv = sv.finish(); | ||
// note that this does not unpack dictionary-encoded arrays | ||
take_record_batch(record_batch, &sv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a faster way to copy is to use MutableArrayData, like done in concatenate:
https://docs.rs/arrow-select/52.2.0/src/arrow_select/concat.rs.html#180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is done inCopyExec
already - this should be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR is to no longer need to use CopyExec after a FilterExec (to avoid copying twice in some cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes got it. It might be possible though to reuse the same code from copyexec in here for copying the recordbatch?
} | ||
let sv = sv.finish(); | ||
// note that this does not unpack dictionary-encoded arrays | ||
take_record_batch(record_batch, &sv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think take_record_batch
actually copies values based on indices. So even the batch is not filtered (all selected), it will copy the batch. Is it different to CopyExec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. The goal is to ensure that FilterExec always copies instead of only copying most of the time, so that we avoid also having a CopyExec that performs yet another copy.
I have implemented the suggestion from @Dandandan to specialize for the case where all rows are selected
Thanks @Dandandan @viirya for the feedback so far. I have now added criterion benchmarks before I start addressing the feedback. Here are the current results.
|
// turn predicate into selection vector | ||
let mut sv = Int32Builder::with_capacity(predicate.true_count()); | ||
for i in 0..predicate.len() { | ||
if !predicate.is_null(i) && predicate.value(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible special case that we should measure the impact of: we could elide !predicate.is_null(i)
if we know there are no nulls in the predicate array. You end up with duplicated loops (one with and without the null check) but depending on the branch predictor this may be worth it.
} | ||
|
||
// BEGIN Comet changes | ||
pub fn comet_filter_record_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn comet_filter_record_batch( | |
fn comet_filter_record_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has to be public so that we can access it from the criterion benchmark but I will double check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
After addressing the first round of feedback, we now have: pub fn comet_filter_record_batch(
record_batch: &RecordBatch,
predicate: &BooleanArray,
) -> std::result::Result<RecordBatch, ArrowError> {
if predicate.true_count() == record_batch.num_rows() {
// special case where we just make an exact copy
let arrays: Vec<ArrayRef> = record_batch
.columns()
.iter()
.map(|array| {
let capacity = array.len();
let data = array.to_data();
let mut mutable = MutableArrayData::new(vec![&data], false, capacity);
mutable.extend(0, 0, capacity);
make_array(mutable.freeze())
})
.collect();
let options = RecordBatchOptions::new().with_row_count(Some(record_batch.num_rows()));
RecordBatch::try_new_with_options(record_batch.schema().clone(), arrays, &options)
} else {
filter_record_batch(record_batch, predicate)
}
} New benchmark results:
This certainly looks a lot better. I am running TPC-DS again to make sure this really is always copying. I had tried an approach like this in the past but ran into data corruption issues. |
let options = RecordBatchOptions::new().with_row_count(Some(record_batch.num_rows())); | ||
RecordBatch::try_new_with_options(record_batch.schema().clone(), arrays, &options) | ||
} else { | ||
filter_record_batch(record_batch, predicate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is good to use this arrow-rs kernel instead of call take
kernel, as filter_record_batch
has some optimizations for filter selectivity.
@@ -44,6 +44,7 @@ datafusion = { default-features = false, git = "https://github.com/apache/datafu | |||
datafusion-functions = { git = "https://github.com/apache/datafusion.git", rev = "41.0.0-rc1", features = ["crypto_expressions"] } | |||
datafusion-functions-nested = { git = "https://github.com/apache/datafusion.git", rev = "41.0.0-rc1", default-features = false } | |||
datafusion-expr = { git = "https://github.com/apache/datafusion.git", rev = "41.0.0-rc1", default-features = false } | |||
datafusion-execution = { git = "https://github.com/apache/datafusion.git", rev = "41.0.0-rc1", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need datafusion-execution
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use use datafusion::execution::TaskContext
. I guess we were just pulling this in transitively before via the datafusion
crate rather than being explicit.
We may want to avoid bringing in the core datafusion
crate and just depend directly on the crates that we need.
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
.map(|array| { | ||
let capacity = array.len(); | ||
let data = array.to_data(); | ||
let mut mutable = MutableArrayData::new(vec![&data], false, capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a Rust question for myself to dig into: is there a reason we can't do array.to_data().clone()
and then call make_array
on that? It seems inefficient to allocate a resizable data structure, truncate it with extend()
then freeze()
it, but I am still very new to this. Thankfully there are good benchmarks in this PR for me to explore that on my own. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clone
will have a shallow copy with the buffers are still shared, but we need actually copying the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought copy()
would make a shallow copy, while clone()
would deep copy the data? If that were the case, we could possibly clone()
the whole RecordBatch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer implementation under the arrays holds references. clone
will just copy the references without copying data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So RecordBatch
stores a Vec
of reference counted arrays columns: Vec<Arc<dyn Array>>
? That makes sense to me that if we clone()
RecordBatch
we just bump ref counts, and we don't get the desired result. But the underlying ArrayData
should be safe to clone()
because its underlying Vec
will deep copy its buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the reference counted stuff all the way down inside Buffer
. I understand why it won't work now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the snippet of Buffer
implementation in arrow-rs:
pub struct Buffer {
/// the internal byte buffer.
data: Arc<Bytes>,
...
}
And ArrayData
:
pub struct ArrayData {
...
buffers: Vec<Buffer>,
...
}
When you clone it, Buffer
won't copy data for you but you only get a clone of Arc
, i.e., a new reference to the existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was necessary to have the CopyExec because FilterExec can pass through input batches in the case where the predicate evaluates to true for all rows and this is not safe in Comet because we re-use arrays in the ScanExec.
I wonder if you can use the machinery with unary_mut to know when it is safe to reuse the arrays and when a copy is required.
So that would look like the scan using something like unary_mut
, which can check if there are other existing references or not.
For example, see the examples here https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut ?
I created a PR against DataFusion to add a flag to FilterExec so that we can switch back to using the upstream version, assumign this PR gets accepted. |
Thanks for the reviews @viirya @Dandandan @mbutrovich |
* Use custom FilterExec that always uses take with a selection vector * Remove CopyExec around FilterExec * remove CopyExec on FilterExec inputs to joins * remove copy before sort in some cases * add comments * cargo fmt * bug fix: check for null when building selection vector * revert * use arrow kernel * remove unused imports * add criterion benchmark * address initial feedback * add ASF header * fix missing imports * Update native/core/src/execution/operators/filter.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Which issue does this PR close?
Closes #757
Rationale for this change
We were often creating a filtered batch in
FilterExec
and then making a copy of that batch inCopyExec
, resulting in redundant copying of data in the case whereFilterExec
had already created a new batch with the filtered data. It was necessary to have theCopyExec
becauseFilterExec
can pass through input batches in the case where the predicate evaluates to true for all rows and this is not safe in Comet because we re-use arrays in theScanExec
.This PR introduces a customized version of
FilterExec
that uses Arrow'stake_record_batch
kernel to always create new batches, even in the case where the predicate evaluates to true for all rows. This removes the need to wrapFilterExec
in aCopyExec
.This reduces our TPC-DS time by ~25 seconds.
The majority of queries are faster with this change:
What changes are included in this PR?
How are these changes tested?