Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Allow read_binary_files(output_arrow_format=True) to return Arrow format #33780

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

c21
Copy link
Contributor

@c21 c21 commented Mar 28, 2023

Why are these changes needed?

This is a reproposal of #32809, to allow read_binary_files to return Arrow format, by adding a parameter output_arrow_format. Default is false, to keep backward compatiblity. Print a warning if output_arrow_format is false. A future release will flip the bit to set this parameter to true default.

Related issue number

Closes #32373 .

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

c21 added 2 commits March 27, 2023 17:35
Signed-off-by: Cheng Su <scnju13@gmail.com>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also do this for other non-Arrow datasources such as from_items, range, read_text?

(Actually, remind me why this couldn't be a 100% backwards compatible change? Is the breakage specific to read_binary_files, or would it apply to from_items/range/read_text as well?)

@c21
Copy link
Contributor Author

c21 commented Mar 28, 2023

Should we also do this for other non-Arrow datasources such as from_items, range, read_text?

  • from_items returns simple format.
  • range returns simple format, but we have range_table returns Arrow format.
  • read_text returns Arrow format.

So it looks to me if we want to do for other datasources, we can also include from_items. But given the scale of from_items is for in-memory Python objects, it looks like low motivation to output Arrow format from performance perspective.

(Actually, remind me why this couldn't be a 100% backwards compatible change? Is the breakage specific to read_binary_files, or would it apply to from_items/range/read_text as well?)

#32809 (comment) is the reason why we cannot keep 100% backwards compatibility. Dataset.iter_rows() has different row structures between Arrow and simple blocks:

Arrow:
for row in ds.iter_rows():
  col1 = row["col1"]
  col2 = row["col2"]

Simple:
for row in ds.iter_rows():
  col1 = row[0]
  col2 = row[1]

So this breakage should apply to all datasources.

@ericl
Copy link
Contributor

ericl commented Mar 28, 2023

I see, so any multi column outputs would be a breaking change. It seems we could probably support from_items and read_text though since those are single column.

I think the ideal outcome would be to deprecate SimpleBlock entirely in favor of single column Arrow tables, so any change in this direction sounds good to me.

@c21
Copy link
Contributor Author

c21 commented Mar 28, 2023

I think the ideal outcome would be to deprecate SimpleBlock entirely in favor of single column Arrow tables, so any change in this direction sounds good to me.

Yes, it's a TODO on team's list.

@c21
Copy link
Contributor Author

c21 commented Mar 28, 2023

btw read_binary is used for reading audio and video, so I feel it's important to make it work well for our users in short-term. It's kind of awkward to ask users always apply this trick:

ds = ray.data.read_binary_files(...)
ds = ds.map_batches(lambda x:x, batch_size=None, batch_format="pyarrow")

And more importantly it may cause users churning before we even know about it.

@c21 c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 28, 2023
@ericl ericl merged commit 5240597 into ray-project:master Mar 28, 2023
@c21 c21 deleted the binary-arrow branch March 28, 2023 04:35
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…n Arrow format (ray-project#33780)

Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…n Arrow format (ray-project#33780)

Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Change BinaryDatasource to output Arrow block
5 participants