Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Re-implement APIs like select_columns with PyArrow batch format #48090

Closed
bveeramani opened this issue Oct 17, 2024 · 0 comments · Fixed by #48140
Closed

[Data] Re-implement APIs like select_columns with PyArrow batch format #48090

bveeramani opened this issue Oct 17, 2024 · 0 comments · Fixed by #48140
Assignees
Labels
data Ray Data-related issues

Comments

@bveeramani
Copy link
Member

select_columns, drop_columns, and add_column are implemented as a map_batches with a UDF that uses the pandas batch format.

def select_columns(batch):
return BlockAccessor.for_block(batch).select(columns=cols)
return self.map_batches(
select_columns,
batch_format="pandas",
zero_copy_batch=True,
compute=compute,
concurrency=concurrency,
**ray_remote_args,
)

This implementation has the consequence of converting Arrow blocks to pandas blocks. Because pandas blocks are more issue-prone, we should re-implement these methods with the "pyarrow" batch format.

(Historical context: we needed to use the pandas format because the "pyarrow" batch format didn't work with arbitrary Python objects. With #45272, we no longer have this restriction)

@bveeramani bveeramani added the data Ray Data-related issues label Oct 17, 2024
MortalHappiness pushed a commit to MortalHappiness/ray that referenced this issue Nov 22, 2024
ray-project#48140)

## Related issue number

Closes ray-project#48090 

Prerequisite: ray-project#48575

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
jecsand838 pushed a commit to jecsand838/ray that referenced this issue Dec 4, 2024
ray-project#48140)

## Related issue number

Closes ray-project#48090

Prerequisite: ray-project#48575

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this issue Dec 7, 2024
ray-project#48140)

## Related issue number

Closes ray-project#48090 

Prerequisite: ray-project#48575

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants