Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: string[pyarrow] dtype doesn't roundtrip through pyarrow #50074

Open
2 of 3 tasks
jrbourbeau opened this issue Dec 5, 2022 · 9 comments
Open
2 of 3 tasks

BUG: string[pyarrow] dtype doesn't roundtrip through pyarrow #50074

jrbourbeau opened this issue Dec 5, 2022 · 9 comments
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data

Comments

@jrbourbeau
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": ["foo", "bar", "baz"]}, dtype="string[pyarrow]")
# Round-trip `df` through pyarrow using `Table.from_pandas` and `Table_to_pandas` + `types_mapper=`
types_mapper={pa.string(): pd.ArrowDtype(pa.string())}
df_pa = pa.Table.from_pandas(df).to_pandas(types_mapper=types_mapper.get)
pd.testing.assert_frame_equal(df, df_pa)
# The assertion above fails with:
# Attribute "dtype" are different
# [left]:  string[pyarrow]
# [right]: string[pyarrow]

Issue Description

While working on improved support for pyarrow-backed dtypes over in Dask, I came across a case where round-tripping a DataFrame with string[pyarrow] dtypes doesn't appear to fully work as expected.

Expected Behavior

When using pa.Table.from_pandas and pa.Table.to_pandas + types_mapper as described in the above example, I would expect to get an equivalent DataFrame back.

Additionally, the

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[pyarrow]

is somewhat confusing as, from the information in the error message, it actually looks like these dtypes should be the same.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.9.15.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Thu Sep 29 20:12:57 PDT 2022; root:xnu-8020.240.7~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.2
numpy            : 1.21.6
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 59.8.0
pip              : 22.3.1
Cython           : None
pytest           : 7.2.0
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.7.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : 2022.11.0
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : None
numba            : 0.56.4
numexpr          : 2.8.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.11.0
scipy            : 1.9.3
snappy           :
sqlalchemy       : 1.4.44
tables           : 3.7.0
tabulate         : None
xarray           : 2022.11.0
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None
@jrbourbeau jrbourbeau added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 5, 2022
@jrbourbeau
Copy link
Contributor Author

Poking around a bit, I've noticed that the underlying .type attribute for df.x.dtype vs. df_pa.x.dtype are different

In [2]: df.x.dtype.type
Out[2]: str

In [3]: df_pa.x.dtype.type
Out[3]: pyarrow.lib.DataType

@mroeschke
Copy link
Member

Thanks for the report. Yeah this is confusing because as of 1.5 there are technically 2 separate pyarrow backed string implementations. This could probably be clearer in our docs https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#pyarrow

In [11]: pd.StringDtype("pyarrow")
Out[11]: string[pyarrow]

In [12]: pd.ArrowDtype(pa.string())
Out[12]: string[pyarrow]

pd.StringDtype + pd.arrays.ArrowStringArray is still more feature complete than pd.ArrowDtype(pa.string) + pd.arrays.ArrowExtensionArray since the latter is not plugged into the pyarrow.compute methods for string operations, but once that's integrated in theory pd.ArrowDtype(pa.string) + pd.arrays.ArrowExtensionArray could be used for everything #48469 (comment)

@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 5, 2022
@jrbourbeau
Copy link
Contributor Author

Ah, I see -- thanks for clarifying @mroeschke. So, for the time being, it sounds like folks should use pd.StringDtype("pyarrow") when converting pyarrow datatypes to pandas dtypes, correct?

FWIW if I switch to using pd.StringDtype("pyarrow") in the original example, things work as expected:

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": ["foo", "bar", "baz"]}, dtype="string[pyarrow]")
# Round-trip `df` through pyarrow using `Table.from_pandas` and `Table_to_pandas` + `types_mapper=`
def types_mapper(pa_type):
    if pa_type == pa.string():
        return pd.StringDtype("pyarrow")
df_pa = pa.Table.from_pandas(df).to_pandas(types_mapper=types_mapper)
pd.testing.assert_frame_equal(df, df_pa)

@mroeschke
Copy link
Member

So, for the time being, it sounds like folks should use pd.StringDtype("pyarrow") when converting pyarrow datatypes to pandas dtypes, correct?

Correct (for strings only). For all other pyarrow data types pd.ArrowDtype should work.

@jrbourbeau
Copy link
Contributor Author

Thanks -- that also fixes the corresponding issue I was running into over in Dask

https://github.com/dask/dask/pull/9719/files#diff-965eb1b3afb3ebaa80c6c3d896d50044911569d696e5e25807d5e22f8d22d668R1588-R1594

@rachtsingh
Copy link

To check my understanding, df = df.astype(dtype={'col': 'string[pyarrow]'}) sets the column dtype to be ArrowStringArray, but I actually want ArrowExtensionArray? I've been trying to figure out why str.split was slow, and it looks like it's calling the object-based API with a Python for loop, which isn't what I wanted.

Does Pandas have an issue tracking when the above astype call will return an ArrowExtensionArray? That looks like the eventual goal, right?

@mroeschke
Copy link
Member

mroeschke commented Oct 24, 2023

but I actually want ArrowExtensionArray?

Yes you can use pd.ArrowDtype(pyarrow.string())

Does Pandas have an issue tracking when the above astype call will return an ArrowExtensionArray? That looks like the eventual goal, right?

Hopefully in pandas 3.0 (coming out next year) where strings are inferred and stored with pyarrow in the future

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Mar 19, 2024

Another roundtrip issues is when simply serializing/deserializing a dataframe to parquet (#42664)

import pandas as pd

df = pd.DataFrame({"col": ["a", "b", "c"]}, dtype="string[pyarrow]")
df.to_parquet("foo.parquet")

df2 = pd.read_parquet("foo.parquet")
pd.testing.assert_frame_equal(df, df2)  # left: string[pyarrow], right: string[python]

df3 = pd.read_parquet("foo.parquet", dtype_backend="pyarrow")
pd.testing.assert_frame_equal(df, df3)  # left: string[pyarrow], right: large_string[pyarrow]

@thesword53
Copy link

Another issue is string[pyarrow] is not behaving as ArrowDtype(pa.string()) which can be used with str[pyarrow] instead of string[pyarrow].

>>> s = pd.Series(["a,b,c", "c,d"], dtype="str[pyarrow]")
>>> s.str.split(",")
0    ['a' 'b' 'c']
1        ['c' 'd']
dtype: list<item: string>[pyarrow]
>>> s = pd.Series(["a,b,c", "c,d"], dtype="string[pyarrow]")
>>> s.str.split(",")
0    [a, b, c]
1       [c, d]
dtype: object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

5 participants