[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

c21 · 2023-03-28T00:39:55Z

Why are these changes needed?

This is a reproposal of #32809, to allow read_binary_files to return Arrow format, by adding a parameter output_arrow_format. Default is false, to keep backward compatiblity. Print a warning if output_arrow_format is false. A future release will flip the bit to set this parameter to true default.

Related issue number

Closes #32373 .

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>

ericl

Should we also do this for other non-Arrow datasources such as from_items, range, read_text?

(Actually, remind me why this couldn't be a 100% backwards compatible change? Is the breakage specific to read_binary_files, or would it apply to from_items/range/read_text as well?)

c21 · 2023-03-28T01:12:05Z

Should we also do this for other non-Arrow datasources such as from_items, range, read_text?

from_items returns simple format.
range returns simple format, but we have range_table returns Arrow format.
read_text returns Arrow format.

So it looks to me if we want to do for other datasources, we can also include from_items. But given the scale of from_items is for in-memory Python objects, it looks like low motivation to output Arrow format from performance perspective.

(Actually, remind me why this couldn't be a 100% backwards compatible change? Is the breakage specific to read_binary_files, or would it apply to from_items/range/read_text as well?)

#32809 (comment) is the reason why we cannot keep 100% backwards compatibility. Dataset.iter_rows() has different row structures between Arrow and simple blocks:

Arrow:
for row in ds.iter_rows():
  col1 = row["col1"]
  col2 = row["col2"]

Simple:
for row in ds.iter_rows():
  col1 = row[0]
  col2 = row[1]

So this breakage should apply to all datasources.

ericl · 2023-03-28T03:20:46Z

I see, so any multi column outputs would be a breaking change. It seems we could probably support from_items and read_text though since those are single column.

I think the ideal outcome would be to deprecate SimpleBlock entirely in favor of single column Arrow tables, so any change in this direction sounds good to me.

c21 · 2023-03-28T03:32:58Z

I think the ideal outcome would be to deprecate SimpleBlock entirely in favor of single column Arrow tables, so any change in this direction sounds good to me.

Yes, it's a TODO on team's list.

c21 · 2023-03-28T03:36:06Z

btw read_binary is used for reading audio and video, so I feel it's important to make it work well for our users in short-term. It's kind of awkward to ask users always apply this trick:

ds = ray.data.read_binary_files(...)
ds = ds.map_batches(lambda x:x, batch_size=None, batch_format="pyarrow")

And more importantly it may cause users churning before we even know about it.

…n Arrow format (ray-project#33780) Signed-off-by: elliottower <elliot@elliottower.com>

…n Arrow format (ray-project#33780) Signed-off-by: Jack He <jackhe2345@gmail.com>

c21 added 2 commits March 27, 2023 17:35

Allow read_binary_files(output_arrow_format=True) to return Arrow format

29543eb

Signed-off-by: Cheng Su <scnju13@gmail.com>

Only print warning if output_arrow_format is false

247ddad

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners March 28, 2023 00:39

c21 assigned ericl, clarkzinzow, jianoaix and amogkam Mar 28, 2023

ericl reviewed Mar 28, 2023

View reviewed changes

ericl approved these changes Mar 28, 2023

View reviewed changes

c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 28, 2023

ericl merged commit 5240597 into ray-project:master Mar 28, 2023

c21 deleted the binary-arrow branch March 28, 2023 04:35

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[Datasets] Allow read_binary_files(output_arrow_format=True) to retur…

dbdbea4

…n Arrow format (ray-project#33780) Signed-off-by: elliottower <elliot@elliottower.com>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[Datasets] Allow read_binary_files(output_arrow_format=True) to retur…

05b146d

…n Arrow format (ray-project#33780) Signed-off-by: Jack He <jackhe2345@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

c21 commented Mar 28, 2023

ericl left a comment

c21 commented Mar 28, 2023 •

edited

Loading

ericl commented Mar 28, 2023

c21 commented Mar 28, 2023

c21 commented Mar 28, 2023

[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

Conversation

c21 commented Mar 28, 2023

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

c21 commented Mar 28, 2023 • edited Loading

ericl commented Mar 28, 2023

c21 commented Mar 28, 2023

c21 commented Mar 28, 2023

c21 commented Mar 28, 2023 •

edited

Loading