-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Pandas StructDType #40652
Comments
An example of converting a PyArrow struct type to Pandas: x = pa.array([(1, 'a'), (2, 'b'), (3, 'c')], type=pa.struct([("idx", pa.int32()), ("val", pa.string())]))
print(x)
# <pyarrow.lib.StructArray object at 0x7f103b132c20>
# -- is_valid: all not null
# -- child 0 type: int32
# [
# 1,
# 2,
# 3
# ]
# -- child 1 type: string
# [
# "a",
# "b",
# "c"
# ]
print(x.to_pandas())
# 0 {'idx': 1, 'val': 'a'}
# 1 {'idx': 2, 'val': 'b'}
# 2 {'idx': 3, 'val': 'c'}
# dtype: object As you can see, it gets converted to a Series of python dictionaries. |
pls link to existing issues |
For reference: cuDF (a GPU implementation of Pandas) has now support for StructDtype (Link). |
Hi all, I tried to implement a StructDtype in #45745 and would be very happy if people would like to comment it a bit :) |
@mroeschke can this go in the "use pd.ArrowDtype" pile? |
Yeah definitely |
Hi @mroeschke, I just have some free hours and wondered how is the state of arbitrary Arrow types in Pandas today. |
A "StructDtype" is able to be used via |
I think this can be closed. #54938 has added a |
I already searched for a while for discussions of nested structures in Pandas, but I couldn't find anything corresponding.
Is your feature request related to a problem?
Currently, there is no way to work with arbitrary nested data types in Pandas.
In Spark and PyArrow, one can have StructTypes. In NumPy, we have Compound types.
However, when we try something like this:
Then we only get an error:
This would be useful for a number of use cases:
Describe the solution you'd like
My wish would be to have a generic Pandas StructDType that:
A perfect example is the
IntervalDType
/IntervalArray
that is already implemented in Pandas:pandas/pandas/core/arrays/interval.py
Line 146 in 2198f51
In my opinion, its implementation is a special case of a Struct dtype.
It also supports conversion to and from PyArrow (see 2198f51).
Therefore, by generalizing the IntervalDType to use any number of subtypes, we would have the StructDType implementation ready.
API breaking implications
to_csv()
, etc. could have difficulties with storing nested data.That's maybe a followup problem to solve.
Describe alternatives you've considered
One can try to construct the Series as a list of tuples.
However, this has two drawbacks:
to_parquet()
failsThe text was updated successfully, but these errors were encountered: