Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pandas StructDType #40652

Closed
Hoeze opened this issue Mar 27, 2021 · 10 comments
Closed

ENH: Pandas StructDType #40652

Hoeze opened this issue Mar 27, 2021 · 10 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@Hoeze
Copy link

Hoeze commented Mar 27, 2021

I already searched for a while for discussions of nested structures in Pandas, but I couldn't find anything corresponding.

Is your feature request related to a problem?

Currently, there is no way to work with arbitrary nested data types in Pandas.
In Spark and PyArrow, one can have StructTypes. In NumPy, we have Compound types.
However, when we try something like this:

df = pd.DataFrame({'pos':[1,2,3], "val": ["a","b", "c"]})
struct = df.to_records(index=False).view(type=np.ndarray, dtype=list(df.dtypes.items()))
pd.Series(struct)

Then we only get an error:

ValueError: Cannot construct a Series from an ndarray with compound dtype.  Use DataFrame instead.

This would be useful for a number of use cases:

  • Direct mapping between PySpark and Pandas nested columns
  • Easy creation of custom data types; type checking of nested columns
  • Efficient serialization with Arrow

Describe the solution you'd like

My wish would be to have a generic Pandas StructDType that:

  • can be composed of any other Pandas DType
  • allows conversion to PyArrow StructType back and forth
  • can be written to Parquet

A perfect example is the IntervalDType/IntervalArray that is already implemented in Pandas:

class IntervalArray(IntervalMixin, ExtensionArray):

In my opinion, its implementation is a special case of a Struct dtype.
It also supports conversion to and from PyArrow (see 2198f51).
Therefore, by generalizing the IntervalDType to use any number of subtypes, we would have the StructDType implementation ready.

API breaking implications

to_csv(), etc. could have difficulties with storing nested data.
That's maybe a followup problem to solve.

Describe alternatives you've considered

One can try to construct the Series as a list of tuples.
However, this has two drawbacks:

  • No type checking
  • to_parquet() fails
@Hoeze Hoeze added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 27, 2021
@Hoeze
Copy link
Author

Hoeze commented Mar 27, 2021

An example of converting a PyArrow struct type to Pandas:

x = pa.array([(1, 'a'), (2, 'b'), (3, 'c')], type=pa.struct([("idx", pa.int32()), ("val", pa.string())]))
print(x)
# <pyarrow.lib.StructArray object at 0x7f103b132c20>
# -- is_valid: all not null
# -- child 0 type: int32
#   [
#     1,
#     2,
#     3
#   ]
# -- child 1 type: string
#   [
#     "a",
#     "b",
#     "c"
#   ]

print(x.to_pandas())
# 0    {'idx': 1, 'val': 'a'}
# 1    {'idx': 2, 'val': 'b'}
# 2    {'idx': 3, 'val': 'c'}
# dtype: object

As you can see, it gets converted to a Series of python dictionaries.
This is super inefficient and also difficult to work with.

@jreback
Copy link
Contributor

jreback commented Mar 27, 2021

pls link to existing issues
eg ListDtype and nested dtypes

@Hoeze
Copy link
Author

Hoeze commented Mar 27, 2021

@jreback here the link to the ListDtype issue:
#35176

I cannot find any issue on a nested dtype.
That's why I opened this one.

@mroeschke mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 28, 2021
@JulianWgs
Copy link

For reference: cuDF (a GPU implementation of Pandas) has now support for StructDtype (Link).

@Hoeze
Copy link
Author

Hoeze commented Jan 31, 2022

Hi all, I tried to implement a StructDtype in #45745 and would be very happy if people would like to comment it a bit :)

@jbrockmendel
Copy link
Member

@mroeschke can this go in the "use pd.ArrowDtype" pile?

@mroeschke
Copy link
Member

Yeah definitely

@Hoeze
Copy link
Author

Hoeze commented May 14, 2023

Hi @mroeschke, I just have some free hours and wondered how is the state of arbitrary Arrow types in Pandas today.
Is it still worth updating my PR #45745?

@mroeschke
Copy link
Member

A "StructDtype" is able to be used via pyarrow.map_ in ArrowDtype so I don't think we need a separate implementation anymore. Thanks for checking in though!

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 27, 2023
@TomAugspurger
Copy link
Contributor

I think this can be closed.

#54938 has added a .struct accessor on top of the arrow struct dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants