Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining on a struct column of lazy dataframes and then collecting gives Panic Exception #10721

Open
2 tasks done
2spmohanty opened this issue Aug 24, 2023 · 2 comments · May be fixed by #21093
Open
2 tasks done

Joining on a struct column of lazy dataframes and then collecting gives Panic Exception #10721

2spmohanty opened this issue Aug 24, 2023 · 2 comments · May be fixed by #21093
Labels
A-panic Area: code that results in panic exceptions bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@2spmohanty
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

src_0_lazy_df & src_1_lazy_df are 2 lazy dataframes.

src_0_lazy_df.collect()
┌───────┬────────────┬────────────┬─────────┬───┬────────────┬────────────┬────────────┬───────────┐
│ rtype ┆ TINYINT_CO ┆ SMALLINT_C ┆ INT_COL ┆ … ┆ DECIMAL_CO ┆ STRING_COL ┆ DATE_COL   ┆ DATETIME_ │
│ ---   ┆ L          ┆ OL         ┆ ---     ┆   ┆ L          ┆ ---        ┆ ---        ┆ COL       │
│ str   ┆ ---        ┆ ---        ┆ i32     ┆   ┆ ---        ┆ str        ┆ datetime[m ┆ ---       │
│       ┆ bool       ┆ i16        ┆         ┆   ┆ f64        ┆            ┆ s]         ┆ datetime[ │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ ms]       │
╞═══════╪════════════╪════════════╪═════════╪═══╪════════════╪════════════╪════════════╪═══════════╡
│ D     ┆ false      ┆ 19706      ┆ 123     ┆ … ┆ 164.98     ┆ Smruti     ┆ 1994-04-27 ┆ 1974-08-0 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 8         │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ 16:28:51  │
│ D     ┆ true       ┆ 23757      ┆ 123     ┆ … ┆ 164.98     ┆ Chetan     ┆ 2019-01-25 ┆ 2001-03-2 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 4         │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ 22:00:13  │
│ D     ┆ true       ┆ -29931     ┆ 345     ┆ … ┆ 173.88     ┆ Jagan      ┆ 1972-05-14 ┆ 2019-01-1 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 5         |
└───────┴────────────┴────────────┴─────────┴───┴────────────┴────────────┴────────────┴───────────┘
src_1_lazy_df.collect()
┌───────┬────────────┬────────────┬─────────┬───┬────────────┬────────────┬────────────┬───────────┐
│ rtype ┆ TINYINT_CO ┆ SMALLINT_C ┆ INT_COL ┆ … ┆ DECIMAL_CO ┆ STRING_COL ┆ DATE_COL   ┆ DATETIME_ │
│ ---   ┆ L          ┆ OL         ┆ ---     ┆   ┆ L          ┆ ---        ┆ ---        ┆ COL       │
│ str   ┆ ---        ┆ ---        ┆ i32     ┆   ┆ ---        ┆ str        ┆ datetime[m ┆ ---       │
│       ┆ bool       ┆ i16        ┆         ┆   ┆ f64        ┆            ┆ s]         ┆ datetime[ │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ ms]       │
╞═══════╪════════════╪════════════╪═════════╪═══╪════════════╪════════════╪════════════╪═══════════╡
│ D     ┆ false      ┆ 19706      ┆ 123     ┆ … ┆ 164.98     ┆ Smruti     ┆ 1984-04-27 ┆ 1974-08-0 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 8         │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ 16:28:51  │
│ D     ┆ true       ┆ 23757      ┆ 123     ┆ … ┆ 164.98     ┆ Chetan     ┆ 2019-01-25 ┆ 2001-03-2 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 4         │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆            ┆ 22:00:13  │
│ D     ┆ true       ┆ -29931     ┆ 345     ┆ … ┆ 173.88     ┆ Jagan      ┆ 1982-05-14 ┆ 2019-01-1 │
│       ┆            ┆            ┆         ┆   ┆            ┆            ┆ 00:00:00   ┆ 5         |
└───────┴────────────┴────────────┴─────────┴───┴────────────┴────────────┴────────────┴───────────┘
join_on= ["INT_COL","STRING_COL"]
col_list = ["INT_COL","STRING_COL", "TINYINT_COL", "DECIMAL_COL" ...]

other_columns = [col for col in col_list if col not in join_on]

"""
I divide the entire dataframe into 2 category 1 Struct of Join on columns and second one is remaining other columns
"""

src0_struct_df= src_0_lazy_df.select(pl.struct(join_on).alias("pks"),
pl.struct(other_columns).alias("Day0"))

src0_struct_df= src_1_lazy_df.select(pl.struct(join_on).alias("pks"),
pl.struct(other_columns).alias("Day1"))

etl_activity = src0_struct_df.join(src1_struct_df, on="pks", how="outer")
print(etl_activity) # Works fine
print(etl_activity.collect())

Issue description

File "C:\\Users\\smruti\\AppData\\Roaming\\Python\\Python310\\site-packages\\polars\\utils\\deprecation.py", line 93, in wrapper
return function(\*args, \*\*kwargs)
File "C:\\Users\\smruti\\AppData\\Roaming\\Python\\Python310\\site-packages\\polars\\lazyframe\\frame.py", line 1561, in collect
return wrap_df(ldf.collect())
pyo3_runtime.PanicException: not implemented

Expected behavior

I have dataset that have 100s columns. I thought dividing the columns to 2 partitions one on joining keys and other on remaining columns will enable a faster lookup on what data got updated , deleted etc..

Installed versions

--------Version info---------
Polars:              0.18.11
Index type:          UInt32
Platform:            Windows-10-10.0.19044-SP0
Python:              3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          <not installed>
numpy:               1.22.3
pandas:              1.4.2
pyarrow:             <not installed>
pydantic:            1.10.2
sqlalchemy:          1.4.46
xlsx2csv:            0.8.1
xlsxwriter:          3.0.7
None

@2spmohanty 2spmohanty added bug Something isn't working python Related to Python Polars labels Aug 24, 2023
@2spmohanty
Copy link
Author

@ritchie46
Copy link
Member

This is not supported. Maybe consider opening a feature request.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added the A-panic Area: code that results in panic exceptions label Jun 17, 2024
@lukemanley lukemanley linked a pull request Feb 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-panic Area: code that results in panic exceptions bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants