-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumPy interop to do list - to_numpy
#14334
Comments
On this subject, could you take another look at #7283 and decide whether it's potentially useful or a definite no-go? |
@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: https://numpy.org/doc/stable/user/basics.rec.html Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :) |
We are aware! You can already do this from DataFrames by setting |
It would also be nice to be able to get np arrays with dtype
instead of this exception in the same way as this works pl.DataFrame({"A": "as", "B":1}).to_numpy()
Out[21]: array([['as', 1]], dtype=object) |
Yes, that should be part of our design for converting nested data. |
Regarding the design for nested types, some of my thoughts: For converting Series to NumPy...
For converting DataFrames to NumPy...
Basically, I'm trying to figure out if it's worth going through the rabbithole of multidimensional arrays, or whether maybe we should keep it simple and have Series be 1D and DataFrames be 2D. That possibly involves changing the behavior for Array types. |
Regarding nested types, I have decided that for now it will work as follows:
Everything on the TODO list here has been done, with the exception of masked array support. I will create a separate issue for that one. |
@stinodego an approach that makes a lot of sense to me would be to maintain a 1-D array for all Series, use a multi-element import numpy as np
a = np.array([65535, 256], dtype=np.uint16)
# construct dtype with two u8 elements
dtype = np.dtype([
("first", np.uint8),
("second", np.uint8),
])
b = a.view(dtype)
# array([(255, 255), (0, 1)], dtype=[('first', 'u1'), ('second', 'u1')]) In this case, This could also work for structs with mixed types:
|
If that is the behavior you want, you can use This type of array is not fit for representing Array types though. Makes sense for Structs. But I don't want it to be the default, e.g. we still need a solution for when |
We've made some improvements to our native
to_numpy
functionality recently. Making an issue to track what's still left:Array
: Handle nulls:Series.to_numpy
doesn't work for Array types with nulls #14268Struct
: (?) Unnest and useDataFrame.to_numpy
?List
: (?) Explode and use the offsets as input fornp.split
?Series.to_numpy
to handle Decimal/Time types in Rust #14296to_numpy
#14353Series.to_numpy
rechunks the series in-place #14340Series.to_numpy
with timezones #14337CopyNotAllowedError
#14350DataFrame.to_numpy
to raise on copy (likeSeries.to_numpy(zero_copy_only=True)
and to returnwriteable
array.Add option toI don't think this makes much sense. It is available on the DataFrame level, where it makes more sense. Users can just doSeries.to_numpy
to allow structured output - relevant for handling Struct types..to_frame().to_numpy()
to achieve this.SeriesView
class in Python, handle views differently.use_pyarrow
parameter and default to native implementation (after native functionality is done)Now that our
to_numpy
can handle things properly and zero copy where possible, I'm not sure the NumPy array interface protocol (#14214) is still useful.The text was updated successfully, but these errors were encountered: