You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We push pruning predicates and limits down to ParquetExec but it seems like the combination could be unsafe, or perhaps I am not comprehending the logic?
Given the predicate x > 10 and a limit of 10, suppose we have the following 2 partitions:
Partition 0 has 100 rows and 5 rows match x > 10
Partition 1 has 100 rows and 5 rows match x > 10
As we iterate over row groups we have
for row_group_meta in meta_data.row_groups(){
num_rows += row_group_meta.num_rows();
we break out of this loop once hitting the limit, based on num_rows
if limit.map(|x| num_rows >= x asi64).unwrap_or(false){
limit_exhausted = true;break;}
This stops processing the file once the limit is reached, without considering how many rows the predicate would match.
Finally we stop processing partitions as well, here:
// remove files that are not needed in case of limit
filenames.truncate(total_files);
partitions.push(ParquetPartition::new(filenames, statistics));if limit_exhausted {break;}
To Reproduce
When I have time will write a test to see if there is an issue here.
Expected behavior
Perhaps we should not apply the limit when we are pushing predicates down?
Additional context
N/A
The text was updated successfully, but these errors were encountered:
I think in actually enabling the pushdown, it shouldn't push down the limit when it has a filter. So it also will not be pushed to the parquet exec.
Doesn't hurt to have some extra tests around this, or maybe some code in the parquet exec to disable / do limit after pruning when the ParquetExec has a predicate.
Describe the bug
We push pruning predicates and limits down to ParquetExec but it seems like the combination could be unsafe, or perhaps I am not comprehending the logic?
Given the predicate
x > 10
and a limit of 10, suppose we have the following 2 partitions:x > 10
x > 10
As we iterate over row groups we have
we break out of this loop once hitting the limit, based on num_rows
This stops processing the file once the limit is reached, without considering how many rows the predicate would match.
Finally we stop processing partitions as well, here:
To Reproduce
When I have time will write a test to see if there is an issue here.
Expected behavior
Perhaps we should not apply the limit when we are pushing predicates down?
Additional context
N/A
The text was updated successfully, but these errors were encountered: