fix: Throw bigidx error for Parquet row-count #18154

coastalwhite · 2024-08-13T09:13:40Z

codecov · 2024-08-13T09:42:11Z

Codecov Report

Attention: Patch coverage is 81.90476% with 19 lines in your changes missing coverage. Please review.

Project coverage is 80.29%. Comparing base (4233d61) to head (742fad0).
Report is 18 commits behind head on main.

Files	Patch %	Lines
crates/polars-core/src/chunked_array/ops/extend.rs	72.72%	6 Missing ⚠️
crates/polars-io/src/parquet/read/read_impl.rs	71.42%	4 Missing ⚠️
crates/polars-error/src/lib.rs	0.00%	3 Missing ⚠️
...s/polars-core/src/series/implementations/binary.rs	50.00%	1 Missing ⚠️
...s-core/src/series/implementations/binary_offset.rs	50.00%	1 Missing ⚠️
...tes/polars-core/src/series/implementations/date.rs	50.00%	1 Missing ⚠️
...polars-core/src/series/implementations/duration.rs	50.00%	1 Missing ⚠️
...s/polars-core/src/series/implementations/floats.rs	50.00%	1 Missing ⚠️
...tes/polars-core/src/series/implementations/time.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #18154      +/-   ##
==========================================
- Coverage   80.32%   80.29%   -0.04%     
==========================================
  Files        1496     1498       +2     
  Lines      197751   198688     +937     
  Branches     2822     2831       +9     
==========================================
+ Hits       158848   159539     +691     
- Misses      38381    38622     +241     
- Partials      522      527       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2024-08-13T11:15:11Z

crates/polars-core/src/utils/mod.rs

        acc_df.vstack_mut(&df)?;
    }
+
+    if acc_height > IdxSize::MAX as usize {


I don't think we should check this here. This should fail deeper down when constructing the chunkedarray. The append operation should do the height check.

We still need some checks for the Parquet reader because of the row_offset, but I added the checks to append and extend.

The following now also throws a proper error:

import polars as pl import numpy as np t1 = pl.Series(np.zeros(4_000_000_000), dtype=bool) t2 = pl.Series(np.zeros(1_000_000_000), dtype=bool) t1.append(t2)

Yes, that's better.

Why do we still need it in parquet? As processing datasets that exceed that limit is fine, we only shouldn't create morsels that are larger.

The problem is that previous_row_count will silently overflow, which might produce wrong results. If you read a file with 5 billion times False. It will overflow there instead of actually getting to the accumulate.

And what if we use u64 here, would that work?

It is quite a big change and the overflow will still happen somewhere.

Alright. Not worth it for now. 👍

ritchie46 · 2024-08-13T11:16:50Z

crates/polars-io/src/parquet/read/read_impl.rs

@@ -353,7 +354,10 @@ fn rg_to_dfs_prefiltered(
        });

        for (_, df) in &dfs {


I wonder if the check is needed here as well if we can catch it at the source, ie chunkedarray construction.

fix: Throw bigidx error for Parquet row-count

5e1709b

coastalwhite requested review from ritchie46, orlp and c-peters as code owners August 13, 2024 09:13

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Aug 13, 2024

ritchie46 reviewed Aug 13, 2024

View reviewed changes

coastalwhite added 2 commits August 14, 2024 09:52

add length check to append and extend

1147bb6

fix doctest

742fad0

ritchie46 merged commit 9b4bd7f into pola-rs:main Aug 14, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Throw bigidx error for Parquet row-count #18154

fix: Throw bigidx error for Parquet row-count #18154

coastalwhite commented Aug 13, 2024

codecov bot commented Aug 13, 2024 •

edited

Loading

ritchie46 Aug 13, 2024

coastalwhite Aug 14, 2024

ritchie46 Aug 14, 2024

coastalwhite Aug 14, 2024

ritchie46 Aug 14, 2024

coastalwhite Aug 14, 2024

ritchie46 Aug 14, 2024

ritchie46 Aug 13, 2024

		@@ -353,7 +354,10 @@ fn rg_to_dfs_prefiltered(
		});

		for (_, df) in &dfs {

fix: Throw bigidx error for Parquet row-count #18154

fix: Throw bigidx error for Parquet row-count #18154

Conversation

coastalwhite commented Aug 13, 2024

codecov bot commented Aug 13, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2024 •

edited

Loading