Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Parquet Row and Page Filtering by default (WIP) #3828

Closed
wants to merge 2 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 13, 2022

Draft until

Which issue does this PR close?

Closes #3463
closes #4085
re #3462

Rationale for this change

This PR turns on parquet scan predicate pushdown (see #3462) by default -- I am putting it up early as part of the testing process (so we can work through any issues it may uncover)

This feature promises to be one of the most significant performance improvements for DataFusion reading from parquet in a while. All the hard work was done by @Ted-Jiang @thinkharderdev and @tustvold

What changes are included in this PR?

Enable pushing filters into the scan directly

Note this feature can be disabled by setting the datafusion.execution.parquet.pushdown_filters configuration setting to false.

Are there any user-facing changes?

Hopefully faster performance

@github-actions github-actions bot added the core Core DataFusion crate label Oct 13, 2022
@alamb alamb force-pushed the alamb/enable_parquet_by_default branch from a0cb27c to e37e7b9 Compare November 5, 2022 11:17
@alamb alamb changed the title Enable Parquet Row Filtering by default (WIP) Enable Parquet Row and Page Filtering by default (WIP) Nov 26, 2022
@alamb alamb force-pushed the alamb/enable_parquet_by_default branch from e37e7b9 to 9359dfa Compare November 26, 2022 10:56
@alamb
Copy link
Contributor Author

alamb commented Nov 26, 2022

A small update here is that when I ran the tpch benchmarks against the default parquet files created by the benchmark I did not see any improvement. Also, there was some sort of error with the page index code which I need to track down

@alamb
Copy link
Contributor Author

alamb commented Nov 26, 2022

Specifically made the parquet files like this:

RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- convert --input ~/tpch_data/data_SF1 --output ~/tpch_data/parquet_data_SF1 --format=parquet

And then ran

RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096          

    Finished release [optimized] target(s) in 0.28s
     Running `target/release/tpch benchmark datafusion --iterations 3 --path /home/alamb/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096`
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: None, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/alamb/tpch_data/parquet_data_SF1", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 1 iteration 0 took 1511.2 ms and returned 4 rows
Query 1 iteration 1 took 1372.2 ms and returned 4 rows
Query 1 iteration 2 took 1419.7 ms and returned 4 rows
Query 1 avg time: 1434.38 ms
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:129:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: ArrowError(ExternalError(ArrowError(ExternalError("Arrow error: External error: Execution error: Arrow error: External error: Arrow error: External error: Execution error: Arrow error: External error: Execution error: Join Error: task 218 panicked"))))
alamb@aal-dev:~/arrow-datafusion$ 

FYI @Ted-Jiang -- haven't had a chance to file this as a ticket or look more carefully into it

@Ted-Jiang
Copy link
Member

Ted-Jiang commented Nov 26, 2022

Specifically made the parquet files like this:

RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- convert --input ~/tpch_data/data_SF1 --output ~/tpch_data/parquet_data_SF1 --format=parquet

And then ran

RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096          

    Finished release [optimized] target(s) in 0.28s
     Running `target/release/tpch benchmark datafusion --iterations 3 --path /home/alamb/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096`
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: None, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/alamb/tpch_data/parquet_data_SF1", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 1 iteration 0 took 1511.2 ms and returned 4 rows
Query 1 iteration 1 took 1372.2 ms and returned 4 rows
Query 1 iteration 2 took 1419.7 ms and returned 4 rows
Query 1 avg time: 1434.38 ms
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:129:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: ArrowError(ExternalError(ArrowError(ExternalError("Arrow error: External error: Execution error: Arrow error: External error: Arrow error: External error: Execution error: Arrow error: External error: Execution error: Join Error: task 218 panicked"))))
alamb@aal-dev:~/arrow-datafusion$ 

FYI @Ted-Jiang -- haven't had a chance to file this as a ticket or look more carefully into it

Thanks for testing this, i will try to figure it out tomorrow.

@Ted-Jiang
Copy link
Member

@alamb i think it fixed by #4387
run

(venv) yangjiang@LM-SHC-15009782 benchmarks % OPT_PARQUET_ENABLE_PAGE_INDEX=true  cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch-parquet  --format parquet --batch-size 4096                 

without error

@alamb alamb force-pushed the alamb/enable_parquet_by_default branch from 9359dfa to c249b07 Compare November 30, 2022 18:57
@alamb alamb closed this Mar 18, 2023
@alamb alamb deleted the alamb/enable_parquet_by_default branch August 8, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable parquet filter pushdown by default Enable parquet page level skipping (page index pruning) by default
2 participants