-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batching to IterableDataset
#7054
Add batching to IterableDataset
#7054
Conversation
Cool ! Thanks for diving into it :) Your implementation is great and indeed supports shuffling and batching, you just need to additionally account for state_dict (for dataset checkpointing+resuming) That being said, I believe the implementation can be made simpler by relying on def batch(self, batch_size: int, drop_last_batch: bool = False) -> "IterableDataset":
def batch(unbatched: dict[str, list]) -> dict[str, list]:
return {k: [v] for k, v in unbatched}
return self.map(batch, batched=True, batch_size=batch_size, drop_last_batch=drop_last_batch) And this way no need to reimplement everything ! (my only small concern is that it's not an Arrow-optimized function so it requires the examples to be manipulated as python objects even if the original data is in Arrow format (e.g. when streaming Parquet files) but it's not a big deal and we can see later if we need to optimize this) |
Thanks a lot for the feedback @lhoestq! I definitely could have saved some time looking into it properly first. 😅 Implemented the Let me know what you think and if this needs some update. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…between Dataset and IterableDataset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks ! I added a few suggestions before we merge.
Thanks for the feedbak @lhoestq! Applied it and referenced the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm !
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* feat: add `.batch() to `IterableDataset` and introduce new `BatchedExamplesIterable` * style: formatting... * refactor: implement feedback to use .map() * test: add tests for new `batch()` method * style: formatting... * fix: remove type hints in `batch_fn()` to fix failing CI * docs: add section "Batching data in IterableDataset" to "Differences between Dataset and IterableDataset" * refactor: apply feedback * docs nit --------- Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
* feat: add `.batch() to `IterableDataset` and introduce new `BatchedExamplesIterable` * style: formatting... * refactor: implement feedback to use .map() * test: add tests for new `batch()` method * style: formatting... * fix: remove type hints in `batch_fn()` to fix failing CI * docs: add section "Batching data in IterableDataset" to "Differences between Dataset and IterableDataset" * refactor: apply feedback * docs nit --------- Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
* feat: add `.batch() to `IterableDataset` and introduce new `BatchedExamplesIterable` * style: formatting... * refactor: implement feedback to use .map() * test: add tests for new `batch()` method * style: formatting... * fix: remove type hints in `batch_fn()` to fix failing CI * docs: add section "Batching data in IterableDataset" to "Differences between Dataset and IterableDataset" * refactor: apply feedback * docs nit --------- Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
I've taken a try at implementing a batched
IterableDataset
as requested in issue #6279. This PR adds a newBatchedExamplesIterable
class and a.batch()
method to theIterableDataset
class.The main changes are:
BatchedExamplesIterable
that groups examples into batches..batch()
method forIterableDataset
to easily create batched versions.I'm not sure if this is exactly what you had in mind and also have not fully tested it atm, so I'd really appreciate your feedback. Does this seem like it's heading in the right direction? I'm happy to make any changes or explore different approaches if needed.
Pinging @lhoestq