-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: string[pyarrow]
dtype doesn't roundtrip through pyarrow
#50074
Comments
Poking around a bit, I've noticed that the underlying In [2]: df.x.dtype.type
Out[2]: str
In [3]: df_pa.x.dtype.type
Out[3]: pyarrow.lib.DataType |
Thanks for the report. Yeah this is confusing because as of 1.5 there are technically 2 separate pyarrow backed string implementations. This could probably be clearer in our docs https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#pyarrow
|
Ah, I see -- thanks for clarifying @mroeschke. So, for the time being, it sounds like folks should use FWIW if I switch to using import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": ["foo", "bar", "baz"]}, dtype="string[pyarrow]")
# Round-trip `df` through pyarrow using `Table.from_pandas` and `Table_to_pandas` + `types_mapper=`
def types_mapper(pa_type):
if pa_type == pa.string():
return pd.StringDtype("pyarrow")
df_pa = pa.Table.from_pandas(df).to_pandas(types_mapper=types_mapper)
pd.testing.assert_frame_equal(df, df_pa) |
Correct (for strings only). For all other pyarrow data types |
Thanks -- that also fixes the corresponding issue I was running into over in Dask |
To check my understanding, Does Pandas have an issue tracking when the above astype call will return an ArrowExtensionArray? That looks like the eventual goal, right? |
Yes you can use
Hopefully in pandas 3.0 (coming out next year) where strings are inferred and stored with pyarrow in the future |
Another roundtrip issues is when simply serializing/deserializing a dataframe to import pandas as pd
df = pd.DataFrame({"col": ["a", "b", "c"]}, dtype="string[pyarrow]")
df.to_parquet("foo.parquet")
df2 = pd.read_parquet("foo.parquet")
pd.testing.assert_frame_equal(df, df2) # left: string[pyarrow], right: string[python]
df3 = pd.read_parquet("foo.parquet", dtype_backend="pyarrow")
pd.testing.assert_frame_equal(df, df3) # left: string[pyarrow], right: large_string[pyarrow] |
Another issue is
|
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
While working on improved support for
pyarrow
-backed dtypes over in Dask, I came across a case where round-tripping a DataFrame withstring[pyarrow]
dtypes doesn't appear to fully work as expected.Expected Behavior
When using
pa.Table.from_pandas
andpa.Table.to_pandas
+types_mapper
as described in the above example, I would expect to get an equivalent DataFrame back.Additionally, the
is somewhat confusing as, from the information in the error message, it actually looks like these dtypes should be the same.
Installed Versions
The text was updated successfully, but these errors were encountered: