-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.convert_dtypes()
converting object
to pd.ArrowDtype
instead of pd.StringDtype
#50971
Comments
Using the ArrowExtensionArray is what we are doing for all the I/O methods, hence this is similar here (should probably adjust the docstring). That said, #50325 is the only feature missing from |
Ah, I see. For some reason I thought strings were special cased to still use |
Just following up on this comment. Now that #51207, does this mean there's no longer a need to use |
Correct but there is a slight caveat. In terms of the string operations, using Would that significantly impact the Dask use case? |
I think
Changing that to ArrowDtype is a breaking change (since as mentioned the behaviour is not exactly the same for all operations). Of course this is still experimental, so we can do breaking changes. But personally, I think the StringDtype behaviour is actually better for most users, and is also the nicer API. (and IMO we should do that everywhere, so not only in convert_dtypes but also in the IO methods. Eg also for |
Sorry missed the button. this is still the behavior on 2.0, but we are using the ArrowExtensionArray, when the dtype_backend option is set on top of it (which is new for 2.0) |
Right in the above example this is still
While I'm not opposed to the |
I think we can close here? Since the addition of dtype_backend this can be configured explicitly |
Removing milestone and blocker for now |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The
DataFrame.convert_dtypes()
docstring describes theconvert_string=
keyword asHowever, when using the
pyarrow
dtype backend, it looks likeconvert_dtypes()
is actually converting topd.ArrowDtype(pa.string())
instead ofpd.StringDtype("pyarrow")
. This seems to differ from the docstring description. It's also my current understanding (which could totally be wrong), that we should generally preferpd.StringDtype("pyarrow")
today as it's more feature complete (xref #50074 (comment))cc @mroeschke @phofl for visibility
Expected Behavior
DataFrame.convert_dtypes()
should convert topd.StringDtype("pyarrow")
instead ofpd.ArrowDtype(pa.string())
when using thepyarrow
dtype backendInstalled Versions
The text was updated successfully, but these errors were encountered: