ENH: Pandas StructDType #40652

Hoeze · 2021-03-27T02:31:40Z

I already searched for a while for discussions of nested structures in Pandas, but I couldn't find anything corresponding.

Is your feature request related to a problem?

Currently, there is no way to work with arbitrary nested data types in Pandas.
In Spark and PyArrow, one can have StructTypes. In NumPy, we have Compound types.
However, when we try something like this:

df = pd.DataFrame({'pos':[1,2,3], "val": ["a","b", "c"]})
struct = df.to_records(index=False).view(type=np.ndarray, dtype=list(df.dtypes.items()))
pd.Series(struct)

Then we only get an error:

ValueError: Cannot construct a Series from an ndarray with compound dtype.  Use DataFrame instead.

This would be useful for a number of use cases:

Direct mapping between PySpark and Pandas nested columns
Easy creation of custom data types; type checking of nested columns
Efficient serialization with Arrow

Describe the solution you'd like

My wish would be to have a generic Pandas StructDType that:

can be composed of any other Pandas DType
allows conversion to PyArrow StructType back and forth
can be written to Parquet

A perfect example is the IntervalDType/IntervalArray that is already implemented in Pandas:

pandas/pandas/core/arrays/interval.py

Line 146 in 2198f51

class IntervalArray(IntervalMixin, ExtensionArray):

In my opinion, its implementation is a special case of a Struct dtype.
It also supports conversion to and from PyArrow (see 2198f51).
Therefore, by generalizing the IntervalDType to use any number of subtypes, we would have the StructDType implementation ready.

API breaking implications

to_csv(), etc. could have difficulties with storing nested data.
That's maybe a followup problem to solve.

Describe alternatives you've considered

One can try to construct the Series as a list of tuples.
However, this has two drawbacks:

No type checking
to_parquet() fails

The text was updated successfully, but these errors were encountered:

Hoeze · 2021-03-27T02:50:19Z

An example of converting a PyArrow struct type to Pandas:

x = pa.array([(1, 'a'), (2, 'b'), (3, 'c')], type=pa.struct([("idx", pa.int32()), ("val", pa.string())]))
print(x)
# <pyarrow.lib.StructArray object at 0x7f103b132c20>
# -- is_valid: all not null
# -- child 0 type: int32
#   [
#     1,
#     2,
#     3
#   ]
# -- child 1 type: string
#   [
#     "a",
#     "b",
#     "c"
#   ]

print(x.to_pandas())
# 0    {'idx': 1, 'val': 'a'}
# 1    {'idx': 2, 'val': 'b'}
# 2    {'idx': 3, 'val': 'c'}
# dtype: object

As you can see, it gets converted to a Series of python dictionaries.
This is super inefficient and also difficult to work with.

jreback · 2021-03-27T12:19:21Z

pls link to existing issues
eg ListDtype and nested dtypes

Hoeze · 2021-03-27T12:56:32Z

@jreback here the link to the ListDtype issue:
#35176

I cannot find any issue on a nested dtype.
That's why I opened this one.

JulianWgs · 2021-10-24T15:54:33Z

For reference: cuDF (a GPU implementation of Pandas) has now support for StructDtype (Link).

Hoeze · 2022-01-31T22:00:36Z

Hi all, I tried to implement a StructDtype in #45745 and would be very happy if people would like to comment it a bit :)

jbrockmendel · 2023-04-10T20:49:05Z

@mroeschke can this go in the "use pd.ArrowDtype" pile?

mroeschke · 2023-04-10T20:51:21Z

Yeah definitely

Hoeze · 2023-05-14T19:17:33Z

Hi @mroeschke, I just have some free hours and wondered how is the state of arbitrary Arrow types in Pandas today.
Is it still worth updating my PR #45745?

mroeschke · 2023-05-15T16:57:56Z

A "StructDtype" is able to be used via pyarrow.map_ in ArrowDtype so I don't think we need a separate implementation anymore. Thanks for checking in though!

TomAugspurger · 2023-09-19T13:58:02Z

I think this can be closed.

#54938 has added a .struct accessor on top of the arrow struct dtype.

Hoeze added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 27, 2021

mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 28, 2021

Hoeze mentioned this issue Jan 31, 2022

Implement nested types: Add StructDtype and StructArray #45745

Closed

4 tasks

NazyS mentioned this issue Feb 8, 2022

BUG: cannot read back columns of dtype interval[datetime64[ns]] from parquet file or pyarrow table #45881

Closed

3 tasks

jbrockmendel mentioned this issue Apr 20, 2023

Make pyarrow a required dependency #52509

Closed

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 27, 2023

TomAugspurger closed this as completed Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Pandas StructDType #40652

ENH: Pandas StructDType #40652

Hoeze commented Mar 27, 2021

Hoeze commented Mar 27, 2021

jreback commented Mar 27, 2021

Hoeze commented Mar 27, 2021

JulianWgs commented Oct 24, 2021

Hoeze commented Jan 31, 2022

jbrockmendel commented Apr 10, 2023

mroeschke commented Apr 10, 2023

Hoeze commented May 14, 2023

mroeschke commented May 15, 2023

TomAugspurger commented Sep 19, 2023

ENH: Pandas StructDType #40652

ENH: Pandas StructDType #40652

Comments

Hoeze commented Mar 27, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Hoeze commented Mar 27, 2021

jreback commented Mar 27, 2021

Hoeze commented Mar 27, 2021

JulianWgs commented Oct 24, 2021

Hoeze commented Jan 31, 2022

jbrockmendel commented Apr 10, 2023

mroeschke commented Apr 10, 2023

Hoeze commented May 14, 2023

mroeschke commented May 15, 2023

TomAugspurger commented Sep 19, 2023