-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Throw bigidx error for Parquet row-count #18154
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18154 +/- ##
==========================================
- Coverage 80.32% 80.29% -0.04%
==========================================
Files 1496 1498 +2
Lines 197751 198688 +937
Branches 2822 2831 +9
==========================================
+ Hits 158848 159539 +691
- Misses 38381 38622 +241
- Partials 522 527 +5 ☔ View full report in Codecov by Sentry. |
crates/polars-core/src/utils/mod.rs
Outdated
acc_df.vstack_mut(&df)?; | ||
} | ||
|
||
if acc_height > IdxSize::MAX as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should check this here. This should fail deeper down when constructing the chunkedarray. The append
operation should do the height check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need some checks for the Parquet reader because of the row_offset
, but I added the checks to append
and extend
.
The following now also throws a proper error:
import polars as pl
import numpy as np
t1 = pl.Series(np.zeros(4_000_000_000), dtype=bool)
t2 = pl.Series(np.zeros(1_000_000_000), dtype=bool)
t1.append(t2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's better.
Why do we still need it in parquet? As processing datasets that exceed that limit is fine, we only shouldn't create morsels that are larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that previous_row_count
will silently overflow, which might produce wrong results. If you read a file with 5 billion times False
. It will overflow there instead of actually getting to the accumulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if we use u64
here, would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite a big change and the overflow will still happen somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Not worth it for now. 👍
@@ -353,7 +354,10 @@ fn rg_to_dfs_prefiltered( | |||
}); | |||
|
|||
for (_, df) in &dfs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the check is needed here as well if we can catch it at the source, ie chunkedarray construction.
Fixes #18149