Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pandas() breaks Python objects in object column #13021

Closed
2 tasks done
unarist opened this issue Dec 12, 2023 · 6 comments · Fixed by #13910
Closed
2 tasks done

to_pandas() breaks Python objects in object column #13021

unarist opened this issue Dec 12, 2023 · 6 comments · Fixed by #13910
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@unarist
Copy link

unarist commented Dec 12, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import polars as pl
from pathlib import Path

df = pl.DataFrame([Path("index.html")], schema={"path": pl.Object})
print(df.to_dicts())
print(df.to_pandas().to_dict(orient="records"))

Output:

[{'path': WindowsPath('index.html')}]
[{'path': b'\x80\x1bbJ\xdf\x01\x00\x00'}]

Log output

No response

Issue description

Both of Polars and Pandas can hold Python objects like pathlib.Path in dtype=object column.

Converting those dataframe by to_pandas() doesn't throw errors, but values seems broken.

Expected behavior

[{'path': WindowsPath('index.html')}]
[{'path': WindowsPath('index.html')}]

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             Windows-10-10.0.22621-SP0
Python:               3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.7.1
numpy:                1.24.2
openpyxl:             <not installed>
pandas:               1.5.3
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@unarist unarist added bug Something isn't working python Related to Python Polars labels Dec 12, 2023
@cmdlineluser
Copy link
Contributor

It seems to be happening during the pyarrow conversion:

arrow_interop::to_py::to_py_rb(&rb, &names, py, pyarrow)

Which does some low-level stuff:

pub(crate) fn to_py_array(array: ArrayRef, py: Python, pyarrow: &PyModule) -> PyResult<PyObject> {

The Path ends up as a fixed_size_binary type:

pl.DataFrame([Path("index.html")]).to_arrow()
# pyarrow.Table
# column_0: fixed_size_binary[8]
# ----
# column_0: [[60BA073201000000]]

Does anybody know what this binary value represents and if the original object can be reconstructed from it?

@barak1412
Copy link
Contributor

Same.
My workaround is:

import pandas as pd
df = ...
print(pd.DataFrame(df.to_dicts()))

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@itamarst
Copy link
Contributor

itamarst commented Jan 19, 2024

I had to change schema=["path"] to schema={"path": pl.Object}, with the former it panics when to_pandas() is called ("not implemented"). With the latter, I can reproduce the issue.

The 8-byte value is presumably the memory address of the Python object, interpreted as bytes instead of as a PyObject*.

@stinodego I will try to fix this.

@itamarst
Copy link
Contributor

itamarst commented Jan 19, 2024

  • Fix with use_pyarrow_extension_array=True
  • Fix with use_pyarrow_extension_array=False
  • Fix same issues with Series.to_pandas().
  • Tests for both cases (Series/Dataframe).

@itamarst
Copy link
Contributor

itamarst commented Jan 19, 2024

What I've learned so far:

  • Polars stores Python objects as a Extension("POLARS_EXTENSION_TYPE", FixedSizeBinary(8), Some("1705683441296583929;124507012827808")) type.
  • FixedSizeBinary is what ends up in the PyArrow-controlled table.
  • This then gets turned into a binary string once it hits Pandas.
  • Pandas doesn't have an Arrow-based representation for an object dtype.

Option 1: Casting

  • For 64-bit platforms the Extension could use a int64 instead of 8 bytes.
  • When exported via PyArrow, the result would be a Pandas Series of int64s.
  • That memory could then be reinterpreted as PyObject* by changing the dtype. Which you can't do from Python, the APIs don't have this use case exposed, would have to be done via Pandas low-level API (if that even exists? seems like maybe not) or more likely NumPy C API. This is not a well-supported use case.
  • If I'm understanding correctly the pointers are being copied from Arrow to a NumPy array. That effectively means an additional reference to all the Python objects, so they'd all have to be incref-d.

Option 2: Bypass PyArrow

pl.Series.to_numpy() does do the right thing for the pl.Object dtype, so can just use that for these particular columns.

@itamarst
Copy link
Contributor

I got Option 2 working pretty easily, so just need to write tests.

itamarst pushed a commit to itamarst/polars that referenced this issue Jan 19, 2024
itamarst pushed a commit to itamarst/polars that referenced this issue Jan 22, 2024
@stinodego stinodego added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jan 25, 2024
itamarst pushed a commit to itamarst/polars that referenced this issue Jan 26, 2024
itamarst pushed a commit to itamarst/polars that referenced this issue Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants