Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.convert_dtypes() converting object to pd.ArrowDtype instead of pd.StringDtype #50971

Open
3 tasks done
jrbourbeau opened this issue Jan 25, 2023 · 9 comments
Open
3 tasks done
Labels
Arrow pyarrow functionality Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jrbourbeau
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": [1, 2, 3], "y": ["foo", "bar", "baz"]})
with pd.option_context("mode.dtype_backend", "pyarrow"):
    df = df.convert_dtypes()

assert df.y.dtype == pd.ArrowDtype(pa.string()) # this passes
assert df.y.dtype == pd.StringDtype("pyarrow")  # this raises

Issue Description

The DataFrame.convert_dtypes() docstring describes the convert_string= keyword as

convert_string: bool, default True
    Whether object dtypes should be converted to `StringDtype()`.

However, when using the pyarrow dtype backend, it looks like convert_dtypes() is actually converting to pd.ArrowDtype(pa.string()) instead of pd.StringDtype("pyarrow"). This seems to differ from the docstring description. It's also my current understanding (which could totally be wrong), that we should generally prefer pd.StringDtype("pyarrow") today as it's more feature complete (xref #50074 (comment))

cc @mroeschke @phofl for visibility

Expected Behavior

DataFrame.convert_dtypes() should convert to pd.StringDtype("pyarrow") instead of pd.ArrowDtype(pa.string()) when using the pyarrow dtype backend

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 7f2aa8f46a4a36937b1be8ec2498c4ff2dd4cf34
python           : 3.10.4.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.2.0
Version          : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.0.dev0+1309.g7f2aa8f46a
numpy            : 1.24.1
pytz             : 2022.1
dateutil         : 2.8.2
setuptools       : 59.8.0
pip              : 22.0.4
Cython           : None
pytest           : 7.1.3
hypothesis       : None
sphinx           : 4.5.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.2.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : 2022.12.1.dev6
fsspec           : 2023.1.0+5.g012816b
gcsfs            : None
matplotlib       : 3.5.1
numba            : None
numexpr          : 2.8.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 11.0.0.dev316
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.10.0
scipy            : 1.9.0
snappy           :
sqlalchemy       : 1.4.35
tables           : 3.7.0
tabulate         : None
xarray           : 2022.3.0
xlrd             : None
zstandard        : None
tzdata           : None
qtpy             : None
pyqt5            : None
@jrbourbeau jrbourbeau added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2023
@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions Arrow pyarrow functionality labels Feb 2, 2023
@phofl
Copy link
Member

phofl commented Feb 2, 2023

Using the ArrowExtensionArray is what we are doing for all the I/O methods, hence this is similar here (should probably adjust the docstring).

That said, #50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string()) should be fine for downstream libraries.

@jrbourbeau
Copy link
Contributor Author

Using the ArrowExtensionArray is what we are doing for all the I/O methods

That said, #50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0

Ah, I see. For some reason I thought strings were special cased to still use pd.StringDtype("pyarrow") in, for example, read_parquet. In that case, I agree that a docstring update is probably all that's needed here

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Feb 16, 2023

That said, #50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string()) should be fine for downstream libraries.

Just following up on this comment. Now that #51207, does this mean there's no longer a need to use pd.StringDtype("pyarrow") as pd.ArrowDtype(pa.string()) is now just as performant/feature complete?

@mroeschke
Copy link
Member

Correct but there is a slight caveat.

In terms of the string operations, using pd.StringDtype("pyarrow") is more permissive and will fall back to numpy-based operations while using pd.ArrowDtype(pa.string()) is more strict and will raise NotImplementedError if the associated pyarrow compute function is not available. The fall back behavior of pd.StringDtype("pyarrow") is essentially equivalent to using pd.StringDtype("python") for those methods.

Would that significantly impact the Dask use case?

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Feb 17, 2023
@jorisvandenbossche jorisvandenbossche added Blocker Blocking issue or pull request for an upcoming release and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023
@jorisvandenbossche
Copy link
Member

I think convert_dtypes should prefer StringDtype over ArrowDtype. We have already been exposing this for a longer time, and so people could already do (using pandas 1.5 here):

In [1]: pd.options.mode.string_storage = "pyarrow"

In [2]: df = pd.DataFrame({'a': ["foo", "bar"]}, dtype=object)

In [3]: df2 = df.convert_dtypes()

In [5]: df2.dtypes
Out[5]: 
a    string
dtype: object

In [6]: df2.dtypes[0]
Out[6]: string[pyarrow]

Changing that to ArrowDtype is a breaking change (since as mentioned the behaviour is not exactly the same for all operations). Of course this is still experimental, so we can do breaking changes. But personally, I think the StringDtype behaviour is actually better for most users, and is also the nicer API.

(and IMO we should do that everywhere, so not only in convert_dtypes but also in the IO methods. Eg also for read_parquet this is a change in behaviour)

@phofl phofl closed this as completed Feb 17, 2023
@phofl phofl reopened this Feb 17, 2023
@phofl
Copy link
Member

phofl commented Feb 17, 2023

Sorry missed the button.

this is still the behavior on 2.0, but we are using the ArrowExtensionArray, when the dtype_backend option is set on top of it (which is new for 2.0)

@mroeschke
Copy link
Member

Right in the above example this is still StringDtype("pyarrow")

In [2]: type(df2.dtypes[0])
Out[2]: pandas.core.arrays.string_.StringDtype

While I'm not opposed to the StringDtype("pyarrow") API style in theory (though for consistency there should be equivalent Type("storage") for all dtypes), we should move users away from the StringDtype("pyarrow") array implementation of sometime falling back or directly returning numpy backed objects. As we've been working toward removing silent casting to different types in other operations, I think the same should apply here; if an operation is not supported by the arrow-backed implementation it would be more explicit for the user to cast to the numpy based implementation if the operation is supported there.

@phofl
Copy link
Member

phofl commented Mar 25, 2023

I think we can close here? Since the addition of dtype_backend this can be configured explicitly

@phofl
Copy link
Member

phofl commented Mar 27, 2023

Removing milestone and blocker for now

@phofl phofl removed the Blocker Blocking issue or pull request for an upcoming release label Mar 27, 2023
@phofl phofl removed this from the 2.0 milestone Mar 27, 2023
@phofl phofl added Closing Candidate May be closeable, needs more eyeballs and removed Bug labels Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

4 participants