API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

jbrockmendel · 2023-04-04T23:08:12Z

In 2.0 we made a lot of progress in ensuring passing dtype=foo or .astype(foo) actually returned the requested dtype rather than silently giving something else. bytes and str are the main remaining cases where we silently do something else (cast to object, but not as consistently as intended).

Instead, let's interpret dtype=str as string[pyarrow] and dtype=bytes as bytes[pyarrow] (with a deprecation cycle, and once we require pyarrow)

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2023-04-13T20:58:36Z

Another option is to interpret dtype=str as dtype=pd.StringDtype() . I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

simonjayhawkins · 2023-04-14T10:20:39Z

I don't know why one would pick string[pyarrow] versus the extension dtype we already created. I'm sure there are good reasons to prefer the pyarrow implementation, but can that be clarified?

I'm also not clear (up-to-date) on what the thinking is here. (hence my comment in #52509 (comment))

jbrockmendel · 2023-04-14T23:41:45Z

A simple case I ran into today where the string[pyarrow] outperforms by 10x

data = ["foo", "bar", "baz", "pow", "zap"]
ser = pd.Series(data * 10**6)
ser2 = ser.astype("string")
ser3 =ser.astype("string[pyarrow]")

%timeit ser == "foo"
222 ms ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser2 == "foo"
249 ms ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit ser3 == "foo"
24.2 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

jbrockmendel · 2023-04-20T22:43:12Z

@datapythonista both here and in #52711 a request has been made to explain how great pyarrow string dtypes are. Want to sing their praises?

jbrockmendel · 2023-04-20T23:10:57Z

Looking at #35864, looks like "zfill" isnt implemented in arrow yet so is slightly slower, but other string methods mentioned later in the thread outperform quite a bit:

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
ser1 = non_padded.astype(str)
ser2 = non_padded.astype("string")
ser3 = non_padded.astype("string[pyarrow]")

%timeit ser.str.zfill(5)
1.67 ms ± 40.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
1.78 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
2.16 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  #  <- ser3

%timeit ser.str.upper()
1.97 ms ± 97.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser1
2.02 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  # <- ser2
163 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <- ser3

jbrockmendel · 2023-04-20T23:21:41Z

xref #49398

jorisvandenbossche · 2024-07-29T15:05:22Z

For the str part of this issue:

Another option is to interpret dtype=str as dtype=pd.StringDtype()

With PDEP-14 accepted, the idea is that dtype=str will be an alias for the new future default string dtype (i.e. pd.StringDtype(na_value=np.nan))

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023

mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data Deprecate Functionality to remove in pandas Arrow pyarrow functionality and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023

jbrockmendel mentioned this issue Apr 13, 2023

Make pyarrow a required dependency #52509

Closed

Dr-Irv mentioned this issue Apr 20, 2023

PDEP-10: Add pyarrow as a required dependency #52711

Merged

1 task

topper-123 mentioned this issue May 10, 2023

API: should dtype=str return array of dtype StringDtype for pandas 2.0? #49398

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

jbrockmendel commented Apr 4, 2023 •

edited

Loading

Dr-Irv commented Apr 13, 2023

simonjayhawkins commented Apr 14, 2023

jbrockmendel commented Apr 14, 2023

jbrockmendel commented Apr 20, 2023

jbrockmendel commented Apr 20, 2023

jbrockmendel commented Apr 20, 2023

jorisvandenbossche commented Jul 29, 2024

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

Comments

jbrockmendel commented Apr 4, 2023 • edited Loading

Dr-Irv commented Apr 13, 2023

simonjayhawkins commented Apr 14, 2023

jbrockmendel commented Apr 14, 2023

jbrockmendel commented Apr 20, 2023

jbrockmendel commented Apr 20, 2023

jbrockmendel commented Apr 20, 2023

jorisvandenbossche commented Jul 29, 2024

jbrockmendel commented Apr 4, 2023 •

edited

Loading