Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "A column is known to be entirely NULL" in PruningPredicate #9223

Conversation

appletreeisyellow
Copy link
Contributor

@appletreeisyellow appletreeisyellow commented Feb 13, 2024

Part of #9171

Rationale for this change

What changes are included in this PR?

  1. Add new method PruningStatistics::row_counts() to get the total row counts in each container.
  2. Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine whether to prune containers that have columns with all NULL. This is done by wrapping a CASE expression around the re-written pruning predicate expression:
    CASE
      WHEN x_null_count = x_row_count THEN false
      ELSE <current_pruning_predicate>
    END

Example 1

If a query has a predicate like:

x = 10

instead of re-writing to

x_min <= 10 AND 10 <= x_max

we want the re-written expression to be

CASE
	WHEN x_null_count = x_row_count THEN false
	ELSE x_min <= 10 AND 10 <= x_max
END

Example 2

Another more complicated example:

x < 5 AND x > 0 OR y = 10

instead of re-writing to

x_max < 5 AND 0 < x_min OR (y_min <= 10 AND 10 <= y_max)

we want the re-written expression to be

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

Are these changes tested?

Yes, updated and added more test coverage

Are there any user-facing changes?

Yes, there is a new API for PruningPredicate called PruningPredicate::row_counts()

@github-actions github-actions bot added the core Core DataFusion crate label Feb 13, 2024
@appletreeisyellow appletreeisyellow changed the title Support A column is known to be entirely NULL in PruningPredicate Support "A column is known to be entirely NULL" in PruningPredicate Feb 13, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 16, 2024
@appletreeisyellow appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch 2 times, most recently from 69bb592 to db95018 Compare February 16, 2024 22:45
@appletreeisyellow appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch from f652f74 to 750cb16 Compare February 18, 2024 22:22
@appletreeisyellow
Copy link
Contributor Author

The major feature is implemented, but there are two things I plan to do before I open it for review:

  1. Add more doc
  2. Test against IOx

I plan to continue the work on the week of March 4, 2024

@appletreeisyellow
Copy link
Contributor Author

appletreeisyellow commented Mar 13, 2024

Picking up this PR again. I'm able to test this branch against InfluxDB IOx by using the new PruningStatistics::row_counts() and verifying that PruningPredicate is able to prune out containers that have columns with all NULL.

@appletreeisyellow appletreeisyellow marked this pull request as ready for review March 13, 2024 21:26
@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

I plan to review this PR later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @appletreeisyellow -- I reviewed this PR quite carefully and I think it looks really nice.

I had a few minor comment suggestions that I view as optional.

I think there is one more important test to add(provide row counts, but not null counts) but otherwise this PR looks ready to go to me.

Also, thank you for very dilligently updating the comments to reflect the new field

cc @viirya as I believe you have previously been interested in the pruning logic

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
END \
AND (\
CASE \
WHEN c2_null_count@5 = c2_row_count@6 THEN false \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice in the future to combine these clauses (so we didn't have the repeated CASE expression) but for now I think this is good enought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Tracked in #9171 (comment)

@@ -138,7 +138,7 @@ physical_plan
SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
--CoalesceBatchesExec: target_batch_size=8192
----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=CASE WHEN column1_null_count@2 = column1_row_count@3 THEN false ELSE column1_min@0 != 42 OR 42 != column1_max@1 END, required_guarantees=[column1 not in (42)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 now that the pruning predicate is getting more complex, perhaps we should not display it by default anymore in explain plans. Maybe we can add a config option (as a follow on PR) that is disabled by default 🤔 to control if it is displayed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Tracked in #9171 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, we can show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of showing truncated pruning predicate by default. I updated this idea in #9171 (comment)

Some(0), // no nulls
Some(1), // 1 null
None, // unknown nulls
Some(4), // 4 nulls, which is the same as the row counts, i.e. this column is all null (don't keep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// i | [-11,-1] | Unknown | Unknown | ==> All rows must pass (must keep)
// i | [NULL, NULL] | 4 | 4 | ==> The column is all null (not keep)
// i | [1, NULL] | 10 | 0 | ==> No rows can pass (not keep)
let expected_ret = &[true, false, true, false, false];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 4th element in this array is false which is different than the 4th element when null counts aren't known . 👍

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

docs: take more feedback
@appletreeisyellow appletreeisyellow force-pushed the chunchun/pruning-predicate-column-known-tobe-null branch from 0977d1b to d69202a Compare March 14, 2024 21:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thank you vevry much @appletreeisyellow

@alamb
Copy link
Contributor

alamb commented Mar 16, 2024

I plan to merge this PR on Monday unless anyone else would like more time to comment

@@ -1320,14 +1457,56 @@ fn build_statistics_expr(
);
}
};
let statistics_expr = wrap_case_expr(statistics_expr, expr_builder)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no null_count or row_count statistics, do we still need to rewrite it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, if there is no null_count or row_count statistics, we don't need to rewrite it. Practically, we don't know about null_count or row_count statistics when the rewrite takes place, because the rewrite happens in PruningPredicate first and then null_count and row_count statistics come in later in PruningStatistics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I think the rules order could be changed? It sounds not making sense to rewrite predicates based on statistics before the statistics are ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I think the doc I wrote was not clear, I updated the doc in a8639fb. What do you think now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating it. Looks better now.

@alamb alamb merged commit fa7ca27 into apache:main Mar 18, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 18, 2024

Thank you @appletreeisyellow and @viirya for the review

/// `x = 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END`
/// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END`
/// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END`
/// `x IS NULL` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_null_count > 0 END`
Copy link
Member

@Ted-Jiang Ted-Jiang Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@appletreeisyellow @alamb Sorry i am confused here, i think here just need x IS NULL | x_null_count > 0 END 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right @Ted-Jiang. Nicely spotted -- I double checked and indeed all that is done is null_count > 0. I'll a PR to fix.

❯ explain select duration_nano from traces where duration_nano IS NULL;

|               |     ParquetExec: file_groups={16 groups: [...]}, projection=[duration_nano], predicate=duration_nano@1 IS NULL, pruning_predicate=duration_nano_null_count@0 > 0, required_guarantees=[] |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: #9986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants