Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid copying batches twice in scan->filter->join #757

Closed
andygrove opened this issue Aug 1, 2024 · 2 comments · Fixed by #835
Closed

Avoid copying batches twice in scan->filter->join #757

andygrove opened this issue Aug 1, 2024 · 2 comments · Fixed by #835
Assignees
Labels
enhancement New feature or request performance
Milestone

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

Because FilterExec can sometimes return its input vectors without copying them (in the case where the predicate evaluates to true for all rows in the batch), we have to wrap this exec in a CopyExec when using this as the input to a join:

// DataFusion Join operators keep the input batch internally. We need
// to copy the input batch to avoid the data corruption from reusing the input
// batch.
let left = if can_reuse_input_batch(&left) {
    Arc::new(CopyExec::new(left))
} else {
    left
};

In the case where the filter does not select all rows in the batch, it will make a copy of the selected rows, and then we copy them again in CopyExec. Perhaps we could avoid this redundant copy.

Describe the potential solution

One idea would be to modify FilterExec to add some metadata to the returned batch to indicate whether it is returning any original vectors and then have CopyExec avoid a copy when this metadata is set.

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Aug 1, 2024
@andygrove andygrove added this to the 0.2.0 milestone Aug 1, 2024
@andygrove
Copy link
Member Author

Related to this, there is a plan to integrate CoalesceBatches logic within FilterExec upstream, which may complicate this more (it may be harder to track when original arrays are being returned)

@andygrove andygrove self-assigned this Aug 2, 2024
@andygrove
Copy link
Member Author

I experimented with copying FilterExec into the Comet code base and adding some extra code to ensure that we never return the input arrays, and removed the CopyExec around join inputs and saw a small performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment