[Data] Re-implement APIs like select_columns
with PyArrow batch format
#48090
Labels
data
Ray Data-related issues
select_columns
with PyArrow batch format
#48090
select_columns
,drop_columns
, andadd_column
are implemented as amap_batches
with a UDF that uses the pandas batch format.ray/python/ray/data/dataset.py
Lines 865 to 875 in d5fa9a0
This implementation has the consequence of converting Arrow blocks to pandas blocks. Because pandas blocks are more issue-prone, we should re-implement these methods with the "pyarrow" batch format.
(Historical context: we needed to use the pandas format because the "pyarrow" batch format didn't work with arbitrary Python objects. With #45272, we no longer have this restriction)
The text was updated successfully, but these errors were encountered: