Access memory address of start of chunk #6320

stinodego · 2023-01-19T13:29:41Z

Required for #5662

In order to finish the DataFrame Interchange Protocol, we need to be able to specify the memory address of where a chunk starts.

Definition stated in the protocol: "Pointer to start of the buffer as an integer."

This should be available as a method on the PySeries Rust object. It does not need to be a method of the Series class - this is strictly for use in the interchange protocol.

The text was updated successfully, but these errors were encountered:

stinodego · 2023-01-27T22:29:08Z

@ritchie46 Looks like this only works for Series with an integer dtype. I tried with boolean and string types and I get a PanicException: not implemented.

Is there a chance we can get this to work for all dtypes that are supported by the interchange protocol? So no nested types.

ritchie46 · 2023-01-28T10:08:54Z

Boolean we can still add. But for other datatypes it doesn't make any sense. A string for instance is not really useful without it offsets. It is represented as a list of bytes, e.g. nested.

ritchie46 · 2023-01-28T10:12:34Z

On second thought I also think the boolean buffer is useless without an offset into that array.

stinodego · 2023-01-28T10:14:48Z

For string columns, I create an offsets buffer like this (maybe this is completely off the mark, but it made sense to me):

offsets = (
          self._col.str.n_chars()
          .fill_null(0)
          .cumsum()
          .extend_constant(None, 1)
          .shift_and_fill(1, 0)
          .rechunk()
      )

Of course, it would be more efficient if I could just access the underlying offsets buffer.

Not sure why boolean buffers would need offsets, since they are not variable length?

ritchie46 · 2023-01-28T10:18:33Z

But shouldn't it be zero copy?

The booleans are represented as a bitmask in a byte slice. Given an array of bytes, you need to know where it starts in the first byte and how many bits are valid (e.g. the length).

stinodego · 2023-01-28T10:26:39Z

But shouldn't it be zero copy?

I admit I got a bit lazy when I saw all the corners that were cut in the Pandas implementation; indeed that code is not zero copy. But I think we're not going to finish this without cutting a few corners ourselves.

I think I'm going to go a different route for now for the interchange: I'll write something that utilizes the pyarrow implementation of the protocol, and throw an error when the user specifies zero copy requirement when there's categoricals in there.

And then we can work our way up from there.

ritchie46 · 2023-01-28T10:29:22Z

For string columns, I create an offsets buffer like this (maybe this is completely off the mark, but it made sense to me):
offsets = (
          self._col.str.n_chars()
          .fill_null(0)
          .cumsum()
          .extend_constant(None, 1)
          .shift_and_fill(1, 0)
          .rechunk()
      )
Of course, it would be more efficient if I could just access the underlying offsets buffer.

Not sure why boolean buffers would need offsets, since they are not variable length?

The offsets are measured in bytes, not in chars. I can give access to that array. Pyarrow could also give it.

Maybe it is good to use pyarrow and polars as a hybrid for this.

stinodego · 2023-01-28T10:30:42Z

Yeah I should've used str.lengths.

I can give access to that array.

I'll get back to you on this, let me wrestle with the pyarrow thing for a bit! 11.0.0 was released recently which includes the protocol.

stinodego added python Related to Python Polars enhancement New feature or an improvement of an existing feature labels Jan 19, 2023

stinodego assigned ritchie46 Jan 19, 2023

stinodego mentioned this issue Jan 19, 2023

feat(python): DataFrame interchange protocol implementation #5662

Closed

28 tasks

ritchie46 mentioned this issue Jan 23, 2023

feat(python): allow internal api to get pointer to values buffer #6385

Merged

ritchie46 closed this as completed in #6385 Jan 23, 2023

stinodego reopened this Jan 28, 2023

ritchie46 mentioned this issue Jan 28, 2023

feat(python): better errors in get_ptr and a probability on a boolean… #6522

Merged

ritchie46 closed this as completed in #6522 Jan 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access memory address of start of chunk #6320

Access memory address of start of chunk #6320

stinodego commented Jan 19, 2023 •

edited

Loading

stinodego commented Jan 27, 2023

ritchie46 commented Jan 28, 2023

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023 •

edited

Loading

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023 •

edited

Loading

Access memory address of start of chunk #6320

Access memory address of start of chunk #6320

Comments

stinodego commented Jan 19, 2023 • edited Loading

stinodego commented Jan 27, 2023

ritchie46 commented Jan 28, 2023

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023 • edited Loading

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023

ritchie46 commented Jan 28, 2023

stinodego commented Jan 28, 2023 • edited Loading

stinodego commented Jan 19, 2023 •

edited

Loading

stinodego commented Jan 28, 2023 •

edited

Loading

stinodego commented Jan 28, 2023 •

edited

Loading