Skip to content

Commit

Permalink
[Data] Throw exception for non-streaming HF datasets with "override_n…
Browse files Browse the repository at this point in the history
…um_blocks" argument (ray-project#47559)

## Why are these changes needed?

As in the issue ray-project#47507, from_huggingface() does not support
override_num_blocks for non-streaming HF Datasets, so we should throw
exception, also we need to pass other arguments for from_huggingface()
if they are using streaming dataset

## Related issue number
Close ray-project#47507

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

I did manual test on exception part. Let me know if I need to do more
tests.

---------

Signed-off-by: Xingyu Long <xingyulong97@gmail.com>
Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
  • Loading branch information
2 people authored and ujjawal-khare committed Oct 15, 2024
1 parent 62f9f3e commit e6f3e45
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2838,8 +2838,21 @@ def from_huggingface(

if isinstance(dataset, datasets.IterableDataset):
# For an IterableDataset, we can use a streaming implementation to read data.
return read_datasource(HuggingFaceDatasource(dataset=dataset))
return read_datasource(
HuggingFaceDatasource(dataset=dataset),
parallelism=parallelism,
concurrency=concurrency,
override_num_blocks=override_num_blocks,
)
if isinstance(dataset, datasets.Dataset):
# For non-streaming Hugging Face Dataset, we don't support override_num_blocks
if override_num_blocks is not None:
raise ValueError(
"`override_num_blocks` parameter is not supported for "
"streaming Hugging Face Datasets. Please omit the parameter or "
"use non-streaming mode to read the dataset."
)

# To get the resulting Arrow table from a Hugging Face Dataset after
# applying transformations (e.g., train_test_split(), shard(), select()),
# we create a copy of the Arrow table, which applies the indices
Expand Down

0 comments on commit e6f3e45

Please sign in to comment.