You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because FilterExec can sometimes return its input vectors without copying them (in the case where the predicate evaluates to true for all rows in the batch), we have to wrap this exec in a CopyExec when using this as the input to a join:
// DataFusion Join operators keep the input batch internally. We need// to copy the input batch to avoid the data corruption from reusing the input// batch.let left = ifcan_reuse_input_batch(&left){Arc::new(CopyExec::new(left))}else{
left
};
In the case where the filter does not select all rows in the batch, it will make a copy of the selected rows, and then we copy them again in CopyExec. Perhaps we could avoid this redundant copy.
Describe the potential solution
One idea would be to modify FilterExec to add some metadata to the returned batch to indicate whether it is returning any original vectors and then have CopyExec avoid a copy when this metadata is set.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Related to this, there is a plan to integrate CoalesceBatches logic within FilterExec upstream, which may complicate this more (it may be harder to track when original arrays are being returned)
I experimented with copying FilterExec into the Comet code base and adding some extra code to ensure that we never return the input arrays, and removed the CopyExec around join inputs and saw a small performance improvement.
What is the problem the feature request solves?
Because
FilterExec
can sometimes return its input vectors without copying them (in the case where the predicate evaluates to true for all rows in the batch), we have to wrap this exec in aCopyExec
when using this as the input to a join:In the case where the filter does not select all rows in the batch, it will make a copy of the selected rows, and then we copy them again in
CopyExec
. Perhaps we could avoid this redundant copy.Describe the potential solution
One idea would be to modify
FilterExec
to add some metadata to the returned batch to indicate whether it is returning any original vectors and then haveCopyExec
avoid a copy when this metadata is set.Additional context
No response
The text was updated successfully, but these errors were encountered: