Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is the combination of limit and predicate push-down safe in ParquetExec? #900

Closed
andygrove opened this issue Aug 16, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

andygrove commented Aug 16, 2021

Describe the bug
We push pruning predicates and limits down to ParquetExec but it seems like the combination could be unsafe, or perhaps I am not comprehending the logic?

Given the predicate x > 10 and a limit of 10, suppose we have the following 2 partitions:

  • Partition 0 has 100 rows and 5 rows match x > 10
  • Partition 1 has 100 rows and 5 rows match x > 10

As we iterate over row groups we have

for row_group_meta in meta_data.row_groups() {
    num_rows += row_group_meta.num_rows();

we break out of this loop once hitting the limit, based on num_rows

if limit.map(|x| num_rows >= x as i64).unwrap_or(false) {
    limit_exhausted = true;
    break;
}

This stops processing the file once the limit is reached, without considering how many rows the predicate would match.

Finally we stop processing partitions as well, here:

// remove files that are not needed in case of limit
filenames.truncate(total_files);
partitions.push(ParquetPartition::new(filenames, statistics));
if limit_exhausted {
    break;
}

To Reproduce
When I have time will write a test to see if there is an issue here.

Expected behavior
Perhaps we should not apply the limit when we are pushing predicates down?

Additional context
N/A

@andygrove andygrove added the bug Something isn't working label Aug 16, 2021
@Dandandan
Copy link
Contributor

I think in actually enabling the pushdown, it shouldn't push down the limit when it has a filter. So it also will not be pushed to the parquet exec.

Doesn't hurt to have some extra tests around this, or maybe some code in the parquet exec to disable / do limit after pruning when the ParquetExec has a predicate.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

I believe this is a dupe of the recently fixed issue:

@alamb alamb closed this as completed Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants