Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting to NumPy masked arrays #16398

Open
stinodego opened this issue May 22, 2024 · 6 comments · May be fixed by #17577
Open

Support converting to NumPy masked arrays #16398

stinodego opened this issue May 22, 2024 · 6 comments · May be fixed by #17577
Labels
A-interop-numpy Area: interoperability with NumPy accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Contributor

NumPy has a masked array concept:
https://numpy.org/doc/stable/reference/maskedarray.html

This type of array consists of a values buffer and a validity buffer. This more closely matches how data is represented in Polars, so it would be good to support.

There are two main benefits:

  • Some columns can retain a type that is closer to their original when they contain nulls.
    • Integer types with nulls can retain their original type
    • Boolean types can become uint8 instead of object types
    • (Possibly) String types can become the new variable length string type instead of object types
  • We can avoid copying the values buffer for Integers, Datetime, and Duration types with nulls

Note that we will still not be able to convert nulls without copy since boolean arrays are UInt8 type (byte-packed) in NumPy, while they are bit-packed in Polars.

API

The desired API would be to add a masked parameter to DataFrame/Series.to_numpy. It defaults to False.

Implementation

We would have to separately convert the values buffer and a validity buffer, and afterwards pass these to the array constructor. We should do this in Rust, as there we have direct access to the values buffer.

@stinodego stinodego added enhancement New feature or an improvement of an existing feature accepted Ready for implementation A-interop-numpy Area: interoperability with NumPy labels May 22, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 22, 2024
@saresend
Copy link

saresend commented Jun 18, 2024

Hey, I'd be interested in taking a stab at this issue if its available!

@stinodego
Copy link
Contributor Author

Hey, I'd be interested in taking a stab at this issue if its available!

Sure, go ahead!

@saresend
Copy link

Hey, I'm beginning to look in to this and just want to make sure I'm clear about what the source for the masked buffer is. Is this something that you envision to be passed as part of the to_numpy function? i.e. I'd be able to write:

x = pl.Series([1,2,-1,4]).to_numpy(mask = [0, 0, 1, 0])

Or is there some other input where users should define the mask?

@stinodego
Copy link
Contributor Author

Or is there some other input where users should define the mask?

The mask is the validity buffer of the Series. The user doesn't define it manually.

@saresend
Copy link

saresend commented Jul 9, 2024

Hey, I've been very slow to get started on this but finally have some time - a quick question about the Series type, is there a way to access the validity buffer without having to know the underlying datatype of the ChunkedArray?

Also, I wanted ask about the behavior for arrays that have a null bitmask - I assume this means that all entries are valid, and we should construct the python array as such?

@dpinol
Copy link
Contributor

dpinol commented Oct 7, 2024

Hi,
any plans to implement this? Since polars 1.x the interoperability of nulls with numpy is impossible through None's not NaN. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-numpy Area: interoperability with NumPy accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

Successfully merging a pull request may close this issue.

3 participants