Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add ArrowDype and .array.ArrowExtensionArray to top level #47818

Merged
merged 10 commits into from
Aug 16, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Jul 22, 2022

Moved ArrowIntervalType and ArrowPeriodType to a separate file arrow/extension_types.py to avoid a circular import.

Will add documentation in a separate PR #47854

@mroeschke mroeschke added the Arrow pyarrow functionality label Jul 22, 2022
@mroeschke mroeschke added this to the 1.5 milestone Jul 22, 2022
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally not fully convinced that we should already expose this publicly at the top-level, without first discussing how we think to integrate this longer term (or at least not without very explicitly marking this as an experimental feature that we don't necessarily intend to keep this way long term, while already allowing people to try it out and give feedback).

For example, I am -1 on deprecating the StringDtype(storage="arrow"). This model allows it to be more an implementation detail (where most users don't need to be aware whether their pandas columns / dtypes are backed by numpy or arrow). (although it is already not fully the case, as a string[python] dtype isn't equal to a string[pyarrow]

@mroeschke
Copy link
Member Author

  1. Definitely will mark ArrowDtype experimental for 1.5 (was planning on the subsequent doc PR)

  2. At least for 1.5, I would like users to be able to create other arrow-backed dtypes from the constructor e.g. Series(..., dtype=int64[pyarrow]) i.e. mimic what is possible now with the string[pyarrow] functionality.

  3. I'm fine with not deprecating things in 1.5. I would just like to highlight that "string[pyarrow]" can be understood by both StringDtype(storage="pyarrow") and ArrowDtype, and at a high level, the functionality of ArrowExtensionArray can be superseded by StringArrowArray in the future if the str methods are moved over.

@mroeschke
Copy link
Member Author

mroeschke commented Jul 25, 2022

ArrowDtype I think should be included in the top level because arrow-types-with-parameters cannot always be constructed from a string e.g.

In [5]: dtype = pd.core.arrays.arrow.dtype.ArrowDtype(pa.timestamp(unit="s", tz="UTC"))

In [6]: pd.Series([datetime.datetime.now()], dtype=dtype)
Out[6]:
0    2022-07-25 10:58:58+00:00
dtype: timestamp[s, tz=UTC][pyarrow]

ArrowExtensionArray I only really included since all the arrays are exposed in pandas.arrays. One use case for exposing this in pandas.arrays is if the user data has an arrow array object they want to bring into pandas e.g.

In [7]: pa_array = pa.array([1, 2])

In [8]: pd.Series(pd.core.arrays.arrow.ArrowExtensionArray(pa_array))
Out[8]:
0   1
1   2
dtype: int64[pyarrow]

#47854 would be a start to the documentation of the functionality I was hoping to demonstrate

@mroeschke mroeschke changed the title ENH: Add ArrowDype and .array.ArrowExtensionDtype to top level ENH: Add ArrowDype and .array.ArrowExtensionArray to top level Jul 25, 2022
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 26, 2022

I'm fine with adding these to the top level (reasoning: just having support for storage of these types in a DataFrame is helpful, even if many operations fail). Agreed with Joris that we shouldn't deprecate StringDtype(storage="arrow"), at least not yet.

@mroeschke
Copy link
Member Author

Okay agreed. Considering deprecation for ArrowStringArray/StringDtype(storage="pyarrow") is definitely out of scope here and the near term.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i agree exposing these at the top level is tantamount to what we do with other types.

@mroeschke
Copy link
Member Author

@jorisvandenbossche as discussed in the dev meeting today, documentation noting that this is experimental was made in 4e3b3b9 in #47854

@mroeschke
Copy link
Member Author

Will probably merge this PR this week if no further comments.

@mroeschke mroeschke mentioned this pull request Aug 15, 2022
@mroeschke
Copy link
Member Author

Going to merge. Can address any followups if needed in #47854

@mroeschke mroeschke merged commit ebbee0b into pandas-dev:main Aug 16, 2022
@mroeschke mroeschke deleted the arrow/api branch August 16, 2022 18:28
YYYasin19 pushed a commit to YYYasin19/pandas that referenced this pull request Aug 23, 2022
…s-dev#47818)

* ENH: Add ArrowDype and .array.ArrowExtensionDtype to top level

* ensure string[pyarrow] dispatches to StringDtype for now

* type ignores

* Address availability of Pyarrow

* Address typing
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…s-dev#47818)

* ENH: Add ArrowDype and .array.ArrowExtensionDtype to top level

* ensure string[pyarrow] dispatches to StringDtype for now

* type ignores

* Address availability of Pyarrow

* Address typing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants