PDEP-10: Add pyarrow as a required dependency #52711

mroeschke · 2023-04-17T21:57:26Z

closes Make pyarrow a required dependency #52509 (Replace xxxx with the GitHub issue number)

I've tried to summarize the motivations and drawbacks from #52509 in this PDEP. Please let me know if there are any reasons on either side that I am missing.

Feel free to continue the discussion in #52509, but as a reminder the voting will take place here.

cc @pandas-dev/pandas-core

twoertwein · 2023-04-17T23:14:14Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+by PyArrow including:
+
+- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
+- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types


I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).

that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future

simonjayhawkins · 2023-04-18T10:39:20Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+- The minimum version of PyArrow will be bumped every major pandas release to the highest
+  PyArrow version that has been released for at least 2 years.


using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Correct

bashtage · 2023-04-18T10:44:08Z

Will there also be a version cap? NumPy is extremely conservative with breaking changes. I can't recall a case where a cap would have helped avoid issues with NumPy changes, especially deprecations and removals. Is pyarrow similarly stable? Does pyarrow have an implicit or explicit deprecation policy? If not, would there need to be a cap on each release too?

Recently in a number of projects I've been downstream of SciPy which has been doing a lot of long-needed but painful clean up. This has resulted in cases where not too old releases of downstream projects are breaking against the most recently released SciPy.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

WillAyd · 2023-04-18T18:53:38Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited
+by PyArrow including:
+
+- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations


Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

lithomas1 · 2023-04-18T20:30:23Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+This PDEP proposes that:
+
+- PyArrow becomes a runtime dependency starting pandas 2.1
+- The minimum version of PyArrow supported starting pandas 2.1 is version 7.


Would this version be consistent across the entire pandas API?

e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?

The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)

attack68 · 2023-04-20T16:31:46Z

In one of the linked issues there is the comment about pandas being backend agnostic. Is it possible that this PDEP can broach that and consider how making pyarrow a dependency defines that objective as being unacheivable or how it would fit in to such a concept. This is not clear to me, or whether that is indeed a goal.

Primarily I have used numpy for linear algebra and pandas as an extension for better indexing, data wrangling and output. the added bloat when incorporating into web apps with limited build space is concerning.

phofl · 2023-04-20T16:35:02Z

This isn't a really useful discussion to have right now (and IMO shouldn't be in scope here). Even if we get to a point where NumPy is optional (this is years away), NumPy is still a required dependency of PyArrow

Dr-Irv

I made some comments, but I'd really like to understand the burden on development if we left things as they are in terms of it being an optional dependency. How much simpler does the code base become if PyArrow is required?

I'm also concerned about the lack of support on Alpine Linux for PyArrow, and maybe we should wait for that before accepting this PDEP.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

rhshadrach · 2023-04-21T02:14:59Z

Is it a good comparison to say that trying to develop arrow-backed arrays without pyarrow as a dependency is like trying to develop NumPy-backed arrays/blocks without NumPy as a dependency?

Co-authored-by: Irv Lustig <irv@princeton.com>

phofl · 2023-07-04T21:30:56Z

date and times are 2 additional dtypes that we can infer properly. Added them to the list as well.

numpy.void is not a drop-in replacement for PyArrow structs, even if we would implement it.

Dr-Irv · 2023-07-05T02:17:22Z

I understand, but doing it this way, we don't know how many people are neutral or happy with it. Even if there's a large volume of criticism, there could be an even larger volume that are neutral or happy with better perf that are probably not going to go out of their way to make their voices heard. For example, if a million people voice criticism, it would seem like a lot, but maybe not if there were 5 million people happy with it or neutral towards it (there's no one size fits all here), but we have no idea what this number would be. And that's the point I am trying to make, this issue on the warning wouldn't gather any data on the flip side of the issue (which I think is equally important), which imo would skew the results, or make them seem worse than they really might be.

IMHO, it's better that we get some feedback, rather than none. The wording as proposed doesn't commit us to saying we will not require pyarrow if we get negative feedback - it just says that we will get the feedback, which gives us the possibility of delaying the requirement based on that feedback.

improve structure, list user benefits more clearly, add faq

Dr-Irv

One small comment about the word "horrendous"

web/pandas/pdeps/0010-required-pyarrow-dependency.md

phofl · 2023-07-13T14:22:56Z

Reworded the horrendous thing. Let's start the voting now

MarcoGorelli

Right, let's do this. After recent discussions and clarifications, I'm sold: mainly based on superseding object dtype in string, list, and struct dtypes, which would be a real and immediate benefit to users

To emphasise: if accepted, this would not change the default for dtypes which are currently non-object numpy dtypes (that would be a separate discussion). It would only change the default for dtypes which are currently object (which I don't think I'm alone in considering an embarrassment)

Dr-Irv · 2023-07-13T14:40:47Z

Reworded the horrendous thing. Let's start the voting now

Agreed. I'd like to propose that we use the newly proposed voting procedure for PDEP-1 on this. I've created an issue at #54106 for the core team to place their votes.

phofl · 2023-07-13T14:49:30Z

Not sure whether I mentioned this here already: Some reading material with Benchmarks for Arrow support/strings

https://medium.com/p/2891d3d96d2b

phofl · 2023-07-30T15:35:09Z

This was accepted in our vote. So updated status and will merge

Comments where adressed

mroeschke added 3 commits April 14, 2023 10:53

Start pdep 10

89a3a3b

Merge remote-tracking branch 'upstream/main' into pdep/pyarrow

cf88b43

finish drawbacks, fix other sections

dafa709

mroeschke added Dependencies Required and optional dependencies Arrow pyarrow functionality PDEP pandas enhancement proposal labels Apr 17, 2023

mroeschke changed the title ~~PDEP: Add pyarrow as a required dependency~~ PDEP-10: Add pyarrow as a required dependency Apr 17, 2023

mroeschke added 2 commits April 17, 2023 15:03

Add number

5e1fbd1

our current version is 7 not 6

44a3321

twoertwein reviewed Apr 17, 2023

View reviewed changes

simonjayhawkins reviewed Apr 18, 2023

View reviewed changes

attack68 reviewed Apr 18, 2023

View reviewed changes

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved

attack68 reviewed Apr 18, 2023

View reviewed changes

web/pandas/pdeps/0010-required-pyarrow-dependency.md Show resolved Hide resolved

mroeschke added 2 commits April 18, 2023 11:06

Merge remote-tracking branch 'upstream/main' into pdep/pyarrow

ea9f5e3

Clarify and fix typo

fbd1aa0

WillAyd reviewed Apr 18, 2023

View reviewed changes

lithomas1 reviewed Apr 18, 2023

View reviewed changes

Dr-Irv reviewed Apr 20, 2023

View reviewed changes

jbrockmendel mentioned this pull request Apr 20, 2023

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

Open

phofl and others added 7 commits April 21, 2023 09:51

Update web/pandas/pdeps/0010-required-pyarrow-dependency.md

6d667b4

Co-authored-by: Irv Lustig <irv@princeton.com>

Update web/pandas/pdeps/0010-required-pyarrow-dependency.md

bed5f0b

Co-authored-by: Irv Lustig <irv@princeton.com>

Update web/pandas/pdeps/0010-required-pyarrow-dependency.md

12622bb

Co-authored-by: Irv Lustig <irv@princeton.com>

Add string as a preferential pyarrow type

864b8d1

Add metric about number of pyarrow import checks

2d4f4fd

Clarify with actual call

bb332ca

Clarify with actual call

a8275fa

Update 0010-required-pyarrow-dependency.md

c3beeb3

MarcoGorelli and others added 4 commits July 5, 2023 07:53

improve structure, list user benefits more clearly, add faq

8347e83

restore little demo

d740403

remove masked part, note that pyarrow dtyeps will likely be ready by 3

959873e

Merge pull request #26 from MarcoGorelli/pdep10-amendments

f936280

improve structure, list user benefits more clearly, add faq

Dr-Irv reviewed Jul 6, 2023

View reviewed changes

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved

Update 0010-required-pyarrow-dependency.md

2db0037

phofl approved these changes Jul 13, 2023

View reviewed changes

MarcoGorelli approved these changes Jul 13, 2023

View reviewed changes

Dr-Irv mentioned this pull request Jul 13, 2023

VOTE: Voting Issue for PDEP-10: Add pyarrow as a required dependency #54106

Closed

mattijn mentioned this pull request Jul 17, 2023

Error calling chart.save when using __dataframe__ protocol vega/altair#3109

Closed

adrinjalali mentioned this pull request Jul 19, 2023

Support other dataframes like polars and pyarrow not just pandas scikit-learn/scikit-learn#25896

Open

mroeschke and others added 2 commits July 25, 2023 13:06

Merge branch 'main' into pdep/pyarrow

c2b8cfe

Update 0010-required-pyarrow-dependency.md

4e05151

phofl merged commit 829cf60 into pandas-dev:main Jul 30, 2023

phofl added this to the 2.1 milestone Jul 30, 2023

mroeschke deleted the pdep/pyarrow branch July 31, 2023 16:47

WillAyd mentioned this pull request Aug 2, 2023

ENH: Add Alpine to CI #50511

Closed

3 tasks

jorisvandenbossche mentioned this pull request Aug 28, 2023

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

h-vetinari mentioned this pull request Nov 1, 2023

[Python] usability improvements for a "minimal" pyarrow apache/arrow#38536

Open

MarcoGorelli mentioned this pull request Jan 25, 2024

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

asishm-wk mentioned this pull request Feb 6, 2024

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

Open

agriyakhetarpal mentioned this pull request Mar 18, 2024

ENH: out-of-tree Pyodide builds in CI for pandas #57891

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDEP-10: Add pyarrow as a required dependency #52711

PDEP-10: Add pyarrow as a required dependency #52711

mroeschke commented Apr 17, 2023

twoertwein Apr 17, 2023

jreback Apr 18, 2023

simonjayhawkins Apr 18, 2023 •

edited

Loading

mroeschke Apr 18, 2023

bashtage commented Apr 18, 2023

WillAyd Apr 18, 2023

MarcoGorelli Jul 3, 2023 •

edited

Loading

lithomas1 Apr 18, 2023

mroeschke Apr 18, 2023

attack68 commented Apr 20, 2023

phofl commented Apr 20, 2023

Dr-Irv left a comment

rhshadrach commented Apr 21, 2023

phofl commented Jul 4, 2023

Dr-Irv commented Jul 5, 2023

Dr-Irv left a comment

phofl commented Jul 13, 2023

MarcoGorelli left a comment •

edited

Loading

Dr-Irv commented Jul 13, 2023

phofl commented Jul 13, 2023

phofl commented Jul 30, 2023

		- The minimum version of PyArrow will be bumped every major pandas release to the highest
		PyArrow version that has been released for at least 2 years.

PDEP-10: Add pyarrow as a required dependency #52711

PDEP-10: Add pyarrow as a required dependency #52711

Conversation

mroeschke commented Apr 17, 2023

twoertwein Apr 17, 2023

Choose a reason for hiding this comment

jreback Apr 18, 2023

Choose a reason for hiding this comment

simonjayhawkins Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

mroeschke Apr 18, 2023

Choose a reason for hiding this comment

bashtage commented Apr 18, 2023

WillAyd Apr 18, 2023

Choose a reason for hiding this comment

MarcoGorelli Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

lithomas1 Apr 18, 2023

Choose a reason for hiding this comment

mroeschke Apr 18, 2023

Choose a reason for hiding this comment

attack68 commented Apr 20, 2023

phofl commented Apr 20, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

rhshadrach commented Apr 21, 2023

phofl commented Jul 4, 2023

Dr-Irv commented Jul 5, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

phofl commented Jul 13, 2023

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

Dr-Irv commented Jul 13, 2023

phofl commented Jul 13, 2023

phofl commented Jul 30, 2023

simonjayhawkins Apr 18, 2023 •

edited

Loading

MarcoGorelli Jul 3, 2023 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading