Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars 1.3.0 fails to collect scanned Parquet containing struct columns #17933

Closed
2 tasks done
danielgafni opened this issue Jul 29, 2024 · 1 comment · Fixed by #17941
Closed
2 tasks done

Polars 1.3.0 fails to collect scanned Parquet containing struct columns #17933

danielgafni opened this issue Jul 29, 2024 · 1 comment · Fixed by #17941
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@danielgafni
Copy link

danielgafni commented Jul 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from hypothesis import given, settings
from polars.testing.parametric import dataframes


# this test fails 

@given(df=dataframes(min_size=5))
@settings(max_examples=100, deadline=None)
def test_polars_parquet_write_read_with_structs(
    df: pl.DataFrame, tmp_path_factory
):
    path = tmp_path_factory.mktemp("test1") / "df.parquet"

    df.write_parquet(path)
    pl.read_parquet(path)  # this works

    ldf = pl.scan_parquet(path)
    ldf.collect()  # this fails!


# this test passes because it's not running over `pl.Struct` types

@given(df=dataframes(excluded_dtypes=[pl.Struct], min_size=5))
@settings(max_examples=100, deadline=None)
def test_polars_parquet_write_read_without_structs(
    df: pl.DataFrame, tmp_path_factory
):
    path = tmp_path_factory.mktemp("test2") / "df.parquet"

    df.write_parquet(path)
    pl.read_parquet(path)  # this works

    ldf = pl.scan_parquet(path)
    ldf.collect()  # this passes!

Log output

>       return wrap_df(ldf.collect(callback))
E       pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("validity mask length must match the number of values"))

Issue description

Might be related:

Expected behavior

Writing and reading a Parquet file should always work

Installed versions

--------Version info---------
Polars:               1.3.0
Index type:           UInt32
Platform:             Linux-6.7.6-arch1-1-x86_64-with-glibc2.39
Python:               3.10.12 (main, Jul 26 2023, 13:14:21) [Clang 16.0.3 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                2.0.1
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@danielgafni danielgafni added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 29, 2024
@danielgafni
Copy link
Author

oh, I found the issue: it doesn't work with pl.Struct anymore. Excluding this type from dataframes strategy fixes this test.

@danielgafni danielgafni changed the title Polars 1.3.0 sometimes fails to collect DataFrames from Parquet Polars 1.3.0 fails to collect scanned Parquet containing struct columns Jul 29, 2024
@coastalwhite coastalwhite added P-high Priority: high A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Jul 30, 2024
@coastalwhite coastalwhite self-assigned this Jul 30, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Jul 31, 2024
@coastalwhite coastalwhite linked a pull request Jul 31, 2024 that will close this issue
@c-peters c-peters added the accepted Ready for implementation label Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants