-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-10: Add pyarrow as a required dependency #52711
Changes from 29 commits
89a3a3b
cf88b43
dafa709
5e1fbd1
44a3321
ea9f5e3
fbd1aa0
6d667b4
bed5f0b
12622bb
864b8d1
2d4f4fd
bb332ca
a8275fa
1148007
b406dc1
ecc4d5b
ec1c0e3
23eb251
dd7c62a
2ddd82a
3c54d22
1b60fbb
70cdf74
14602a6
2cfb92f
e0e406c
f047032
ed28c04
99de932
99fd739
9384bc7
c3beeb3
8347e83
d740403
959873e
f936280
2db0037
c2b8cfe
4e05151
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
# PDEP-10: PyArrow as a required dependency for default string inference implementation | ||
|
||
- Created: 17 April 2023 | ||
- Status: Under discussion | ||
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) | ||
[#52509](https://github.com/pandas-dev/pandas/issues/52509) | ||
- Author: [Matthew Roeschke](https://github.com/mroeschke) | ||
[Patrick Hoefler](https://github.com/phofl) | ||
- Revision: 1 | ||
|
||
## Abstract | ||
|
||
This PDEP proposes that: | ||
|
||
- PyArrow becomes a required runtime dependency starting with pandas 3.0 | ||
- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow. | ||
- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has | ||
been released for at least 2 years. | ||
- The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting | ||
with pandas 3.0. | ||
- Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users | ||
environment when pandas is imported. This will ensure that only one warning is raised and users can | ||
easily silence it if necessary. | ||
- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` | ||
instead of `object` | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Background | ||
|
||
PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas: | ||
|
||
- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet | ||
- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an | ||
optional string data type backed by PyArrow | ||
- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV | ||
- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow | ||
data types within the `ExtensionArray` interface | ||
- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods | ||
now utilize PyArrow compute functions to | ||
accelerate PyArrow-backed data in pandas, notibly string and datetime types. | ||
|
||
As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: | ||
|
||
1. Consistent `NA` support for all data types | ||
2. Broader support of data types such as `decimal`, `date` and nested types | ||
|
||
Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type | ||
is `object`. With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default | ||
the inferred type to the more efficient pyarrow string type. | ||
|
||
```python | ||
In [1]: import pandas as pd | ||
|
||
In [2]: pd.Series(["a"]).dtype | ||
# Current behavior | ||
Out[2]: dtype('O') | ||
|
||
# Future behavior in 3.0 | ||
Out[2]: string[pyarrow] | ||
``` | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Motivation | ||
|
||
While all the functionality described in the previous paragraph is currently optional, PyArrow has significant | ||
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow | ||
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with | ||
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow | ||
ecosystem to pandas users. | ||
|
||
Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy | ||
functionality that would be better suited by PyArrow including: | ||
|
||
- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure this comment by Will has been addressed (unless I missed it?) to make it easier to find: the link is here, and says:
|
||
|
||
- Removing redundant functionality: | ||
- fastparquet engine in `read_parquet` | ||
- potentially simplifying the `read_csv` logic (needs more investigation) | ||
|
||
- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as: | ||
- decimal | ||
- binary | ||
- nested types (list or dict data) | ||
- strings | ||
|
||
Out of this group, strings offer the most advantages for users. They use significantly less memory and are faster: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Haven't kept up with this, but how are the plans to add the new numpy string dtype (xref #47884 ) going to affect the rationale here? I would assume performance of the numpy string dtype would be on par with the pyarrow one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is still years' away #52711 (comment) I can't remember the perf comparison - @ngoldbaum do you want to comment here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The linked comment said that numpy strings are available "within a year or so". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I interpreted that as "ready within numpy" - adding in extra time to make them available in pandas, plus accounting for Hofstadter's law, "year's away" seems realistic
(Nathan - we discussed timelines before, but I didn't write them down so have forgotten them, apologies) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I hope it doesn't take that long!
The earliest pandas could officially support the dtype I'm working on is after the release of Numpy 2.0 - currently scheduled for January 2024. This assumes the new dtype API is available for downstream use in Numpy 2.0 without needing to set an environment variable. I'm hoping to start shipping experimental support in pandas behind the environment variable after Numpy 1.25 is released this summer, as that version of Numpy will hopefully have a version of the new dtype API that is usable for pandas' needs. The version in Numpy 1.24 is broken and is missing a lot of features we've added since that release, unfortunately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The memory usage should be comparable with pyarrow strings. Both are storing UTF8 bytestreams internally. I don't know offhand if arrow uses the small string optimization (storing the string content in the space normally reserved for a pointer to the string). It's difficult to compare memory usage exactly since the operating system facilities for this only allow you to measure the peak memory usage of a process and not all allocations necessarily use Python's allocation tracking machinery. I'm hoping to do a more careful memory usage benchmark as part of the NEP I'm writing. The main difference in the storage is that right now I'm using individual heap allocations for each string array entry. Arrow just does a single allocation for all the array entries and has a secondary array of offsets to find the data for each string array element. I've thought a bit about following that approach, but it would mean we would have to either disallow mutating string arrays or there or have pathological behavior where enlarging a single array element could cause the entire array to get reallocated. It would also be nice to be able to use the short string optimization, arrows approach with an array of offsets would make that more difficult. For performance, do you mean for string manipulation operations like case folding or padding? In principle NumPy could add string ufuncs that would allow for fast implementations, but right now NumPy doesn't have a namespace for that. Currently, all the comparison operators are implemented as ufuncs, but no other string functionality is. There are string manipulation functions in the I don't want to promise that string ufuncs definitely will happen in the future, but there's no real technical blockers, just social ones. NumPy doesn't currently have any ufuncs that only make sense with string data, so some thought needs to go into where in the namespace they should go. It will also require a decent amount of implementation work to add the functions, although mostly just tedious C coding. Overall the goal is to facilitate a straightforward transition from workflows that used object string arrays while enabling possible performance improvements in the future that are currently impossible with object string arrays. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally, pandas is moving away from mutability in some sense (the CoW adoption), so that isn't very high on my priority list. While a storage efficient string dtype is nice, this is kind of pointless if the operations aren't fast from a pandas PoV. One of the biggest advantages of arrow is that we can reduce memory but also that most operations are significantly faster, depending on what you are doing it's can be an order of magnitude. I am referring mostly to stuff like the So even if NumPy strings are ready in around a year (or some other time period), that's not helpful for us as long as NumPy does not ship fast algorithms on top of it. Sorry if this sounds harsh, that wasn't my intention. But having the string dtype without algorithms gets us only half the way compared to what PyArrow does, so this isn't a compelling argument to avoid making Arrow strings the default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a minimum, a fast regex engine could potentially help as some of the str accessor functions were (maybe still are) implemented using regex for string[pyarrow] where the functions did not exist in PyArrow (or the minimum version supported at the time). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you suggesting to implement this in pandas? That's something I personally don't have any interest in doing and would also be at least 0- on adding for the time being. Having this stuff in Arrow is nice since it reduces maintenance burden and also having better test coverage since more libraries will depend on it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect a regex engine would be implemented in Numpy and then any str accessor functions not implemented in NumPy could be implemented using either regex or object fallback in pandas (just like we did for PyArrow initially). |
||
|
||
**Performance:** | ||
|
||
```python | ||
import string | ||
import random | ||
|
||
import pandas as pd | ||
|
||
|
||
def random_string() -> str: | ||
return "".join(random.choices(string.printable, k=random.randint(10, 100))) | ||
|
||
|
||
ser_object = pd.Series([random_string() for _ in range(1_000_000)]) | ||
ser_string = ser_object.astype("string[pyarrow]")\ | ||
``` | ||
|
||
PyArrow backed strings are significantly faster than NumPy object strings: | ||
|
||
*str.len* | ||
|
||
```python | ||
In[1]: %timeit ser_object.str.len() | ||
118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||
|
||
In[2]: %timeit ser_string.str.len() | ||
24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||
``` | ||
|
||
*str.startswith* | ||
|
||
```python | ||
In[3]: %timeit ser_object.str.startswith("a") | ||
136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||
|
||
In[4]: %timeit ser_string.str.startswith("a") | ||
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) | ||
``` | ||
|
||
Another advantage is I/O. PyArrow engines in pandas can provide a significant speedup. Currently, the data | ||
are cast to NumPy dtypes, which requires roundtripping when converting back to PyArrow strings explicitly, which | ||
hinders performance. | ||
|
||
**Memory** | ||
|
||
PyArrow backed strings use significantly less memory. Dask developers investigated this [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes). | ||
|
||
Short summary: PyArrow strings required 1/3 of the original memory. | ||
|
||
|
||
## Drawbacks | ||
|
||
Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow | ||
using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires around `120MB`. An increase | ||
of installation size would have negative implication using pandas in space-constrained development or deployment environments | ||
such as AWS Lambda. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you elaborate on how this was calculated please? my calculations don't match: I did python3.10 -m venv newvenv
. newvenv/bin/activate
pip install numpy pytz python-dateutil
du -h newvenv/ --max-depth=1
pip install pandas
du -h newvenv/ --max-depth=1
pip install pyarrow
du -h newvenv/ --max-depth=1 and am seeing:
i.e. we'd be going from 65M to 191M There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
maybe I was just misinterpreting this. Could you please rephrase to "and including PyArrow requires an additional |
||
|
||
Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`, | ||
the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include | ||
|
||
- Alpine linux (commonly used as a base for Docker containers) | ||
- WASM (pyodide and pyscript) | ||
- Python development versions | ||
|
||
Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when | ||
supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version | ||
before releasing a new pandas version. | ||
|
||
### PDEP-10 History | ||
|
||
- 17 April 2023: Initial version | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO the revision history only needs to include the updates for published PDEPs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a real opinion on this, there were some requests that we should include it in this case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just an observation. I'll be downvoting this proposal anyway so no real need to change this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's worthwhile to indicate any major changes that came about as a result of the discussion |
||
|
||
[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability> | ||
[^2] <https://arrow.apache.org/powered_by/> | ||
attack68 marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall some discussion we had on this PDEP about having the warning point to a GitHub issue where we could collect feedback on this requirement. If we agree on this concept, I think it should be mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding that the warning will point to the feedback issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the naive question, but I thought the purpose of this PDEP was to collect feedback on pyarrow as a required dependency? I understand visibility may be more if there's a link to a new issue in a warning in a future version of pandas, but to me, doing it that way (to collect feedback all over again) just seems like we're going to be having this discussion again in 6 months and will end up kicking kicking the can down the road.
I think that issue would only get traction from people who strongly don't want pyarrow. There could be millions of users happy or neutral with the requirement, and we'd only see the 10 people unhappy enough to voice their concerns.