Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access memory address of start of chunk #6320

Closed
Tracked by #5662
stinodego opened this issue Jan 19, 2023 · 8 comments · Fixed by #6385 or #6522
Closed
Tracked by #5662

Access memory address of start of chunk #6320

stinodego opened this issue Jan 19, 2023 · 8 comments · Fixed by #6385 or #6522
Assignees
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@stinodego
Copy link
Member

stinodego commented Jan 19, 2023

Required for #5662

In order to finish the DataFrame Interchange Protocol, we need to be able to specify the memory address of where a chunk starts.

Definition stated in the protocol: "Pointer to start of the buffer as an integer."

This should be available as a method on the PySeries Rust object. It does not need to be a method of the Series class - this is strictly for use in the interchange protocol.

@stinodego
Copy link
Member Author

@ritchie46 Looks like this only works for Series with an integer dtype. I tried with boolean and string types and I get a PanicException: not implemented.

Is there a chance we can get this to work for all dtypes that are supported by the interchange protocol? So no nested types.

@stinodego stinodego reopened this Jan 28, 2023
@ritchie46
Copy link
Member

Boolean we can still add. But for other datatypes it doesn't make any sense. A string for instance is not really useful without it offsets. It is represented as a list of bytes, e.g. nested.

@ritchie46
Copy link
Member

On second thought I also think the boolean buffer is useless without an offset into that array.

@stinodego
Copy link
Member Author

stinodego commented Jan 28, 2023

For string columns, I create an offsets buffer like this (maybe this is completely off the mark, but it made sense to me):

offsets = (
          self._col.str.n_chars()
          .fill_null(0)
          .cumsum()
          .extend_constant(None, 1)
          .shift_and_fill(1, 0)
          .rechunk()
      )

Of course, it would be more efficient if I could just access the underlying offsets buffer.

Not sure why boolean buffers would need offsets, since they are not variable length?

@ritchie46
Copy link
Member

But shouldn't it be zero copy?

The booleans are represented as a bitmask in a byte slice. Given an array of bytes, you need to know where it starts in the first byte and how many bits are valid (e.g. the length).

@stinodego
Copy link
Member Author

But shouldn't it be zero copy?

I admit I got a bit lazy when I saw all the corners that were cut in the Pandas implementation; indeed that code is not zero copy. But I think we're not going to finish this without cutting a few corners ourselves.

I think I'm going to go a different route for now for the interchange: I'll write something that utilizes the pyarrow implementation of the protocol, and throw an error when the user specifies zero copy requirement when there's categoricals in there.

And then we can work our way up from there.

@ritchie46
Copy link
Member

For string columns, I create an offsets buffer like this (maybe this is completely off the mark, but it made sense to me):

offsets = (
          self._col.str.n_chars()
          .fill_null(0)
          .cumsum()
          .extend_constant(None, 1)
          .shift_and_fill(1, 0)
          .rechunk()
      )

Of course, it would be more efficient if I could just access the underlying offsets buffer.

Not sure why boolean buffers would need offsets, since they are not variable length?

The offsets are measured in bytes, not in chars. I can give access to that array. Pyarrow could also give it.

Maybe it is good to use pyarrow and polars as a hybrid for this.

@stinodego
Copy link
Member Author

stinodego commented Jan 28, 2023

Yeah I should've used str.lengths.

I can give access to that array.

I'll get back to you on this, let me wrestle with the pyarrow thing for a bit! 11.0.0 was released recently which includes the protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
2 participants