Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Avoid redundant copying of arrays in scan->filter->join #762

Closed
wants to merge 10 commits into from

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Closes #757

Rationale for this change

Improve performance of joins by removing some redundant copying of arrays for join inputs.

What changes are included in this PR?

This PR adds a new CometFilterExec which is a copy of DataFusion's FilterExec with one small change to ensure that input arrays are never emitted without copying:

// BEGIN HACK: this is the only modification to the DataFusion version
// we ensure that we never return the original arrays
let filtered_batch = if filtered_batch.num_rows() == batch.num_rows() {
    RecordBatch::try_new(
        filtered_batch.schema(),
        filtered_batch
            .columns()
            .iter()
            .map(|array| copy_array(array))
            .collect(),
    )?
} else {
    filtered_batch
};
// END HACK

In planner.rs we have this check for join inputs:

let left = if can_reuse_input_batch(&left) {
    Arc::new(CopyExec::new(left))
} else {
    left
};

Because can_reuse_input_batch returns false for CometFilterExec, we are longer creating the CopyExec.

How are these changes tested?

Existing tests

@andygrove andygrove marked this pull request as draft August 2, 2024 02:19
@andygrove
Copy link
Member Author

I'm not seeing any performance improvement so far

Before

OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_filtered_scans                                 596            696         115         24.4          41.1       1.0X
join_filtered_scans: Comet (Scan, Exec)            1162           1224          93         12.5          80.1       0.5X

After

OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.5.0-41-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
join_filtered_scans                                 574            688         129         25.3          39.6       1.0X
join_filtered_scans: Comet (Scan, Exec)            1181           1231          65         12.3          81.5       0.5X

@andygrove andygrove closed this Aug 12, 2024
@andygrove andygrove deleted the join-no-copy-exec branch December 3, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid copying batches twice in scan->filter->join
1 participant