-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch
method to Dataset
class
#7064
Add batch
method to Dataset
class
#7064
Conversation
Looks good to me ! :) you might want to add the |
Thanks for the feedback @lhoestq! The last commits include:
WDYT? |
You can put the documentation in process.mdx :) |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
af3d739
to
7b02d5f
Compare
I reset the head to the commit before I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks ! the CI failures are unrelated to your PR
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
This PR introduces a new
batch
method to theDataset
class, aligning its functionality with theIterableDataset.batch()
method (implemented in #7054). The implementation uses as well the existingmap
method for efficient batching of examples.Key changes:
batch
method toDataset
class inarrow_dataset.py
map
method for batchingCloses #7063
Once the approach is approved, i will create the tests and update the documentation.