Support converting to NumPy masked arrays #16398

stinodego · 2024-05-22T13:36:38Z

NumPy has a masked array concept:
https://numpy.org/doc/stable/reference/maskedarray.html

This type of array consists of a values buffer and a validity buffer. This more closely matches how data is represented in Polars, so it would be good to support.

There are two main benefits:

Some columns can retain a type that is closer to their original when they contain nulls.
- Integer types with nulls can retain their original type
- Boolean types can become uint8 instead of object types
- (Possibly) String types can become the new variable length string type instead of object types
We can avoid copying the values buffer for Integers, Datetime, and Duration types with nulls

Note that we will still not be able to convert nulls without copy since boolean arrays are UInt8 type (byte-packed) in NumPy, while they are bit-packed in Polars.

API

The desired API would be to add a masked parameter to DataFrame/Series.to_numpy. It defaults to False.

Implementation

We would have to separately convert the values buffer and a validity buffer, and afterwards pass these to the array constructor. We should do this in Rust, as there we have direct access to the values buffer.

The text was updated successfully, but these errors were encountered:

saresend · 2024-06-18T00:23:59Z

Hey, I'd be interested in taking a stab at this issue if its available!

stinodego · 2024-06-18T07:10:36Z

Hey, I'd be interested in taking a stab at this issue if its available!

Sure, go ahead!

saresend · 2024-06-20T03:36:12Z

Hey, I'm beginning to look in to this and just want to make sure I'm clear about what the source for the masked buffer is. Is this something that you envision to be passed as part of the to_numpy function? i.e. I'd be able to write:

x = pl.Series([1,2,-1,4]).to_numpy(mask = [0, 0, 1, 0])

Or is there some other input where users should define the mask?

stinodego · 2024-06-20T08:22:41Z

Or is there some other input where users should define the mask?

The mask is the validity buffer of the Series. The user doesn't define it manually.

saresend · 2024-07-09T16:42:40Z

Hey, I've been very slow to get started on this but finally have some time - a quick question about the Series type, is there a way to access the validity buffer without having to know the underlying datatype of the ChunkedArray?

Also, I wanted ask about the behavior for arrays that have a null bitmask - I assume this means that all entries are valid, and we should construct the python array as such?

dpinol · 2024-10-07T12:44:13Z

Hi,
any plans to implement this? Since polars 1.x the interoperability of nulls with numpy is impossible through None's not NaN. thanks

stinodego added enhancement New feature or an improvement of an existing feature accepted Ready for implementation A-interop-numpy Area: interoperability with NumPy labels May 22, 2024

stinodego added this to Backlog May 22, 2024

github-project-automation bot moved this to Ready in Backlog May 22, 2024

stinodego mentioned this issue May 22, 2024

NumPy interop to do list - to_numpy #14334

Closed

14 tasks

stinodego mentioned this issue Jun 21, 2024

DataFrame.to_numpy() converts None to nan #17113

Closed

2 tasks

saresend linked a pull request Jul 11, 2024 that will close this issue

feat[python]: Add masked array support to numpy interop API #17577

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support converting to NumPy masked arrays #16398

Support converting to NumPy masked arrays #16398

stinodego commented May 22, 2024

saresend commented Jun 18, 2024 •

edited

Loading

stinodego commented Jun 18, 2024

saresend commented Jun 20, 2024

stinodego commented Jun 20, 2024

saresend commented Jul 9, 2024 •

edited

Loading

dpinol commented Oct 7, 2024

Support converting to NumPy masked arrays #16398

Support converting to NumPy masked arrays #16398

Comments

stinodego commented May 22, 2024

API

Implementation

saresend commented Jun 18, 2024 • edited Loading

stinodego commented Jun 18, 2024

saresend commented Jun 20, 2024

stinodego commented Jun 20, 2024

saresend commented Jul 9, 2024 • edited Loading

dpinol commented Oct 7, 2024

saresend commented Jun 18, 2024 •

edited

Loading

saresend commented Jul 9, 2024 •

edited

Loading