Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Conserve Parquet SortingColumns for ints #19251

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Oct 15, 2024

This PR makes it so that SortedColumns can be used to preserve the sorted flag when reading into Polars. Currently, this is only enabled for integers as other types might require additional considerations. Enabling this feature for other types is trivial now, however.

import polars as pl
import pyarrow.parquet as pq
import io

f = io.BytesIO()

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5, None],
    "b": [1.0, 2.0, 3.0, 4.0, 5.0, None],
    "c": range(6),
})

pq.write_table(
    df.to_arrow(),
    f,
    sorting_columns=[
        pq.SortingColumn(0, False, False),
        pq.SortingColumn(1, False, False),
    ],
)

f.seek(0)
df = pl.read_parquet(f)._to_metadata(stats='sorted_asc')

Before:

shape: (3, 2)
┌─────────────┬────────────┐
│ column_name ┆ sorted_asc │
│ ---         ┆ ---        │
│ str         ┆ bool       │
╞═════════════╪════════════╡
│ a           ┆ false      │
│ b           ┆ false      │
│ c           ┆ false      │
└─────────────┴────────────┘

After:

shape: (3, 2)
┌─────────────┬────────────┐
│ column_name ┆ sorted_asc │
│ ---         ┆ ---        │
│ str         ┆ bool       │
╞═════════════╪════════════╡
│ a           ┆ true       │
│ b           ┆ false      │
│ c           ┆ false      │
└─────────────┴────────────┘

This PR makes it so that `SortedColumns` can be used to preserve the sorted
flag when reading into Polars. Currently, this is only enabled for integers as
other types might require additional considerations. Enabling this feature for
other types is trivial now, however.

```rust
import polars as pl
import pyarrow.parquet as pq
import io

f = io.BytesIO()

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5, None],
    "b": [1.0, 2.0, 3.0, 4.0, 5.0, None],
    "c": range(6),
})

pq.write_table(
    df.to_arrow(),
    f,
    sorting_columns=[
        pq.SortingColumn(0, False, False),
        pq.SortingColumn(1, False, False),
    ],
)

f.seek(0)
df = pl.read_parquet(f)._to_metadata(stats='sorted_asc')
```

Before:

```console
shape: (3, 2)
┌─────────────┬────────────┐
│ column_name ┆ sorted_asc │
│ ---         ┆ ---        │
│ str         ┆ bool       │
╞═════════════╪════════════╡
│ a           ┆ false      │
│ b           ┆ false      │
│ c           ┆ false      │
└─────────────┴────────────┘
```

After:

```console
shape: (3, 2)
┌─────────────┬────────────┐
│ column_name ┆ sorted_asc │
│ ---         ┆ ---        │
│ str         ┆ bool       │
╞═════════════╪════════════╡
│ a           ┆ true       │
│ b           ┆ false      │
│ c           ┆ false      │
└─────────────┴────────────┘
```
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 15, 2024
Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 94.02985% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.08%. Comparing base (e29e9df) to head (3cd0c62).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/parquet/read/read_impl.rs 93.54% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19251      +/-   ##
==========================================
+ Coverage   79.68%   80.08%   +0.39%     
==========================================
  Files        1532     1528       -4     
  Lines      209211   209614     +403     
  Branches     2416     2415       -1     
==========================================
+ Hits       166710   167866    +1156     
+ Misses      41953    41195     -758     
- Partials      548      553       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

Nice feature! Still something wrong with the tests...

@ritchie46 ritchie46 merged commit 3720494 into pola-rs:main Oct 16, 2024
23 of 24 checks passed
@coastalwhite coastalwhite deleted the feat/pq-conserve-sortingcolumns branch October 16, 2024 08:04
@c-peters c-peters added the accepted Ready for implementation label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants