Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

Open
jbrockmendel opened this issue Apr 4, 2023 · 7 comments
Open

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

jbrockmendel opened this issue Apr 4, 2023 · 7 comments
Labels
Arrow pyarrow functionality Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 4, 2023

In 2.0 we made a lot of progress in ensuring passing dtype=foo or .astype(foo) actually returned the requested dtype rather than silently giving something else. bytes and str are the main remaining cases where we silently do something else (cast to object, but not as consistently as intended).

Instead, let's interpret dtype=str as string[pyarrow] and dtype=bytes as bytes[pyarrow] (with a deprecation cycle, and once we require pyarrow)

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data Deprecate Functionality to remove in pandas Arrow pyarrow functionality and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 13, 2023

Another option is to interpret dtype=str as dtype=pd.StringDtype() . I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

@simonjayhawkins
Copy link
Member

I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

I'm also not clear (up-to-date) on what the thinking is here. (hence my comment in #52509 (comment))

@jbrockmendel
Copy link
Member Author

A simple case I ran into today where the string[pyarrow] outperforms by 10x

data = ["foo", "bar", "baz", "pow", "zap"]
ser = pd.Series(data * 10**6)
ser2 = ser.astype("string")
ser3 =ser.astype("string[pyarrow]")

%timeit ser == "foo"
222 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser2 == "foo"
249 ms ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser3 == "foo"
24.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jbrockmendel
Copy link
Member Author

@datapythonista both here and in #52711 a request has been made to explain how great pyarrow string dtypes are. Want to sing their praises?

@jbrockmendel
Copy link
Member Author

Looking at #35864, looks like "zfill" isnt implemented in arrow yet so is slightly slower, but other string methods mentioned later in the thread outperform quite a bit:

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
ser1 = non_padded.astype(str)
ser2 = non_padded.astype("string")
ser3 = non_padded.astype("string[pyarrow]")

%timeit ser.str.zfill(5)
1.67 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
1.78 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
2.16 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  #  <- ser3

%timeit ser.str.upper()
1.97 ms ± 97.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
2.02 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
163 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <- ser3

@jbrockmendel
Copy link
Member Author

xref #49398

@jorisvandenbossche
Copy link
Member

For the str part of this issue:

Another option is to interpret dtype=str as dtype=pd.StringDtype()

With PDEP-14 accepted, the idea is that dtype=str will be an alias for the new future default string dtype (i.e. pd.StringDtype(na_value=np.nan))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

5 participants