Support "A column is known to be entirely NULL" in `PruningPredicate` #9223

appletreeisyellow · 2024-02-13T22:14:26Z

Part of #9171

Rationale for this change

What changes are included in this PR?

Add new method PruningStatistics::row_counts() to get the total row counts in each container.
Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine whether to prune containers that have columns with all NULL. This is done by wrapping a CASE expression around the re-written pruning predicate expression:
```
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE <current_pruning_predicate>
END
```

Example 1

If a query has a predicate like:

x = 10

instead of re-writing to

x_min <= 10 AND 10 <= x_max

we want the re-written expression to be

CASE
	WHEN x_null_count = x_row_count THEN false
	ELSE x_min <= 10 AND 10 <= x_max
END

Example 2

Another more complicated example:

x < 5 AND x > 0 OR y = 10

instead of re-writing to

x_max < 5 AND 0 < x_min OR (y_min <= 10 AND 10 <= y_max)

we want the re-written expression to be

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

Are these changes tested?

Yes, updated and added more test coverage

Are there any user-facing changes?

Yes, there is a new API for PruningPredicate called PruningPredicate::row_counts()

…row_count THEN false ELSE ... END`

…counts chore: fix pruning_predicate in slt tests chore: clippy

appletreeisyellow · 2024-02-26T15:49:19Z

The major feature is implemented, but there are two things I plan to do before I open it for review:

Add more doc
Test against IOx

I plan to continue the work on the week of March 4, 2024

…-null

appletreeisyellow · 2024-03-13T21:17:55Z

Picking up this PR again. I'm able to test this branch against InfluxDB IOx by using the new PruningStatistics::row_counts() and verifying that PruningPredicate is able to prune out containers that have columns with all NULL.

alamb · 2024-03-14T13:25:25Z

I plan to review this PR later today

alamb

Thank you @appletreeisyellow -- I reviewed this PR quite carefully and I think it looks really nice.

I had a few minor comment suggestions that I view as optional.

I think there is one more important test to add(provide row counts, but not null counts) but otherwise this PR looks ready to go to me.

Also, thank you for very dilligently updating the comments to reflect the new field

cc @viirya as I believe you have previously been interested in the pruning logic

datafusion/core/src/physical_optimizer/pruning.rs

alamb · 2024-03-14T16:55:48Z

datafusion/core/src/physical_optimizer/pruning.rs

+            END \
+        AND (\
+                CASE \
+                    WHEN c2_null_count@5 = c2_row_count@6 THEN false \


It would be nice in the future to combine these clauses (so we didn't have the repeated CASE expression) but for now I think this is good enought

👍 Tracked in #9171 (comment)

alamb · 2024-03-14T16:56:45Z

datafusion/sqllogictest/test_files/repartition_scan.slt

@@ -138,7 +138,7 @@ physical_plan
 SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
 --CoalesceBatchesExec: target_batch_size=8192
 ----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
+------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=CASE WHEN column1_null_count@2 = column1_row_count@3 THEN false ELSE column1_min@0 != 42 OR 42 != column1_max@1 END, required_guarantees=[column1 not in (42)]


🤔 now that the pruning predicate is getting more complex, perhaps we should not display it by default anymore in explain plans. Maybe we can add a config option (as a follow on PR) that is disabled by default 🤔 to control if it is displayed

👍 Tracked in #9171 (comment)

By default, we can show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate.

I like the idea of showing truncated pruning predicate by default. I updated this idea in #9171 (comment)

alamb · 2024-03-14T16:59:01Z

datafusion/core/src/physical_optimizer/pruning.rs

+                Some(0), // no nulls
+                Some(1), // 1 null
+                None,    // unknown nulls
+                Some(4), // 4 nulls, which is the same as the row counts, i.e. this column is all null (don't keep)


alamb · 2024-03-14T16:59:46Z

datafusion/core/src/physical_optimizer/pruning.rs

+        //  i  | [-11,-1]     | Unknown    | Unknown     | ==> All rows must pass (must keep)
+        //  i  | [NULL, NULL] | 4          | 4           | ==> The column is all null (not keep)
+        //  i  | [1, NULL]    | 10         | 0           | ==> No rows can pass (not keep)
+        let expected_ret = &[true, false, true, false, false];


the 4th element in this array is false which is different than the 4th element when null counts aren't known . 👍

datafusion/core/src/physical_optimizer/pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> docs: take more feedback

alamb

This looks great to me -- thank you vevry much @appletreeisyellow

alamb · 2024-03-16T12:06:45Z

I plan to merge this PR on Monday unless anyone else would like more time to comment

datafusion/core/src/physical_optimizer/pruning.rs

viirya · 2024-03-16T17:17:25Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -1320,14 +1457,56 @@ fn build_statistics_expr(
            );
        }
    };
+    let statistics_expr = wrap_case_expr(statistics_expr, expr_builder)?;


If there is no null_count or row_count statistics, do we still need to rewrite it?

Theoretically, if there is no null_count or row_count statistics, we don't need to rewrite it. Practically, we don't know about null_count or row_count statistics when the rewrite takes place, because the rewrite happens in PruningPredicate first and then null_count and row_count statistics come in later in PruningStatistics

Hmm, then I think the rules order could be changed? It sounds not making sense to rewrite predicates based on statistics before the statistics are ready.

My bad, I think the doc I wrote was not clear, I updated the doc in a8639fb. What do you think now?

Thank you for updating it. Looks better now.

alamb · 2024-03-18T22:53:55Z

Thank you @appletreeisyellow and @viirya for the review

Ted-Jiang · 2024-04-07T07:42:04Z

datafusion/core/src/physical_optimizer/pruning.rs

+/// `x = 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END`
+/// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END`
+/// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END`
+/// `x IS NULL`  | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_null_count > 0 END`


@appletreeisyellow @alamb Sorry i am confused here, i think here just need x IS NULL | x_null_count > 0 END 🤔

https://github.com/apache/arrow-datafusion/blob/fa7ca27c15328247dbf98b2f8773c19398b8a745/datafusion/core/src/physical_optimizer/pruning.rs#L1285-L1288

I think you are right @Ted-Jiang. Nicely spotted -- I double checked and indeed all that is done is null_count > 0. I'll a PR to fix.

❯ explain select duration_nano from traces where duration_nano IS NULL; | | ParquetExec: file_groups={16 groups: [...]}, projection=[duration_nano], predicate=duration_nano@1 IS NULL, pruning_predicate=duration_nano_null_count@0 > 0, required_guarantees=[] |

update: #9986

github-actions bot added the core Core DataFusion crate label Feb 13, 2024

appletreeisyellow changed the title ~~Support A column is known to be entirely NULL in PruningPredicate~~ Support "A column is known to be entirely NULL" in PruningPredicate Feb 13, 2024

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 16, 2024

appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch 2 times, most recently from 69bb592 to db95018 Compare February 16, 2024 22:45

appletreeisyellow added 6 commits February 18, 2024 16:21

feat: add row_counts() to PruningStatistics trait

fa48bb5

chore: remove comments

ee84c83

feat(pruning): add predicate rewrite for `CASE WHEN x_null_count = x_…

9cd3a06

…row_count THEN false ELSE ... END`

chore: clippy and update pruning predicates in tests

46278e2

chore(pruning): fix data type and column expression for null and row …

40749ea

…counts chore: fix pruning_predicate in slt tests chore: clippy

doc: add examples in doc

750cb16

appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch from f652f74 to 750cb16 Compare February 18, 2024 22:22

appletreeisyellow mentioned this pull request Feb 26, 2024

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Closed

appletreeisyellow added 2 commits March 13, 2024 09:27

Merge branch 'main' into chunchun/pruning-predicate-column-known-tobe…

7253a3a

…-null

chore: update comments

600788e

appletreeisyellow marked this pull request as ready for review March 13, 2024 21:26

alamb mentioned this pull request Mar 14, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

5 tasks

alamb reviewed Mar 14, 2024

View reviewed changes

docs: use feedback

d69202a

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> docs: take more feedback

appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch from 0977d1b to d69202a Compare March 14, 2024 21:22

test: add test

a385bff

alamb approved these changes Mar 15, 2024

View reviewed changes

viirya reviewed Mar 16, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 16, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 16, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/pruning.rs Show resolved Hide resolved

viirya reviewed Mar 16, 2024

View reviewed changes

appletreeisyellow added 2 commits March 18, 2024 09:59

docs: update comments

275ddc6

docs: update comments to put rewritten predicate first

a8639fb

viirya approved these changes Mar 18, 2024

View reviewed changes

alamb merged commit fa7ca27 into apache:main Mar 18, 2024
23 checks passed

appletreeisyellow deleted the chunchun/pruning-predicate-column-known-tobe-null branch March 19, 2024 01:53

alamb mentioned this pull request Apr 5, 2024

Prune columns / pages that are all null in ParquetExec by connecting up row_counts in pruning statistics #9961

Closed

Ted-Jiang reviewed Apr 7, 2024

View reviewed changes

lasantosr mentioned this pull request Apr 7, 2024

chore(rust): bump arrow v51 and datafusion v37.1 delta-io/delta-rs#2395

Merged

alamb mentioned this pull request Apr 7, 2024

Minor: fix bug in pruning predicate doc #9986

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "A column is known to be entirely NULL" in `PruningPredicate` #9223

Support "A column is known to be entirely NULL" in `PruningPredicate` #9223

appletreeisyellow commented Feb 13, 2024 •

edited

Loading

appletreeisyellow commented Feb 26, 2024

appletreeisyellow commented Mar 13, 2024 •

edited

Loading

alamb commented Mar 14, 2024

alamb left a comment

alamb Mar 14, 2024

appletreeisyellow Mar 14, 2024

alamb Mar 14, 2024

appletreeisyellow Mar 14, 2024

viirya Mar 16, 2024

appletreeisyellow Mar 18, 2024

alamb Mar 14, 2024

alamb Mar 14, 2024

alamb left a comment

alamb commented Mar 16, 2024

viirya Mar 16, 2024

appletreeisyellow Mar 18, 2024

viirya Mar 18, 2024

appletreeisyellow Mar 18, 2024

viirya Mar 18, 2024

alamb commented Mar 18, 2024

Ted-Jiang Apr 7, 2024 •

edited

Loading

Ted-Jiang Apr 7, 2024

alamb Apr 7, 2024

alamb Apr 7, 2024

Support "A column is known to be entirely NULL" in PruningPredicate #9223

Support "A column is known to be entirely NULL" in PruningPredicate #9223

Conversation

appletreeisyellow commented Feb 13, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Example 1

Example 2

Are these changes tested?

Are there any user-facing changes?

appletreeisyellow commented Feb 26, 2024

appletreeisyellow commented Mar 13, 2024 • edited Loading

alamb commented Mar 14, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Mar 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 18, 2024

Ted-Jiang Apr 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Support "A column is known to be entirely NULL" in `PruningPredicate` #9223

Support "A column is known to be entirely NULL" in `PruningPredicate` #9223

appletreeisyellow commented Feb 13, 2024 •

edited

Loading

appletreeisyellow commented Mar 13, 2024 •

edited

Loading

Ted-Jiang Apr 7, 2024 •

edited

Loading