[AIR/Data] Add `collate_fn` to `iter_torch_batches` #32412

amogkam · 2023-02-10T03:42:33Z

Signed-off-by: amogkam amogkamsetty@yahoo.com

Adds a collate_fn argument to iter_batches and iter_torch_batches. This is useful for any last meter preprocessing that's done directly on the batch to be used for training.

Closes #32224.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

python/ray/data/dataset.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

python/ray/data/dataset.py

python/ray/air/_internal/torch_utils.py

python/ray/data/dataset.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

bveeramani

LGTM

python/ray/air/_internal/torch_utils.py

python/ray/data/_internal/block_batching.py

python/ray/data/_internal/bulk_dataset_iterator.py

python/ray/data/_internal/pipelined_dataset_iterator.py

python/ray/data/dataset.py

python/ray/data/tests/test_dataset_iterator.py

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

python/ray/air/_internal/torch_utils.py

python/ray/data/dataset.py

clarkzinzow · 2023-02-14T17:14:30Z

python/ray/data/dataset.py

@@ -2858,6 +2859,7 @@ def iter_batches(
                to select ``numpy.ndarray`` for tensor datasets and
                ``Dict[str, numpy.ndarray]`` for tabular datasets. Default is "default".
            drop_last: Whether to drop the last batch if it's incomplete.
+            collate_fn: A function to apply to each data batch before returning it.


Hmm I don't know if it makes sense to have this as part of the ds.iter_batches() API, since the primary motivation behind this feature is to (1) make the Torch DataLoader port easier, and (2) provide a hook to override how we convert NumPy ndarrays to Torch Tensors. collate_fn doesn't provide much critical value for the core ds.iter_batches() API: it's functionally exactly equivalent to

for batch in ds.iter_batches(): batch = collate_fn(batch)

Is the primary motivation for this to be able to include it in the local pipelining operation chain, to improve collation performance? If that's the case, I'm wondering if ds.iter_torch_batches() shouldn't reuse ds.iter_batches(), and we should instead have ds.iter_torch_batches() use the internal batch_block_refs API directly, which can expose collate_fn. Most of the batching logic is captured in batch_block_refs anyway.

Yeah I was also thinking the same...will make the change.

Easiest thing is just to have a private arg in iter_batches. IMO that would be preferable for future maintenance than both APIs directly using batch_block_refs

Updated to make _collate_fn private for iter_batches

Hmm I'd rather not muddy the user-facing .iter_batches() API even with a private _collate_fn arg, if possible, since it's already a pretty wide interface. And since batch_block_refs contains most of the batching logic, I don't think that there's much maintenance or drift risk here.

What are your primary concerns for maintenance here?

it's mostly about the timing logic, but it's not a big deal. Updated to using batch_block_refs directly.

If stuff like execution triggering and stats collection becomes involved enough around batch_block_refs, we can always introduce a private Dataset._iter_batches() method that can support all of the other batch-based iterator APIs without making the user-facing Dataset.iter_batches() API any wider.

But for the sake of this PR, duplicating that thin logic around batch_block_refs seems fine to me!

this actually doesn't work for the pipelined case...for some reason that calls into Dataset.iter_torch_batches

DatasetPipeline makes an implicit assumption that Dataset.iter_torch_batches calls into self.iter_batches, which no longer holds true if we change this implementation.

I think private arg is the way to go, because the abstractions here are very messy. We can remove this later once we deprecate DatasetPipeline.

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

… collate_fn

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

clarkzinzow

LGTM, nice work on this one!

python/ray/data/_internal/block_batching.py

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Adds a collate_fn argument to iter_batches and iter_torch_batches. This is useful for any last meter preprocessing that's done directly on the batch to be used for training. Closes ray-project#32224. Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Adds a collate_fn argument to iter_batches and iter_torch_batches. This is useful for any last meter preprocessing that's done directly on the batch to be used for training. Closes ray-project#32224. Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

Adds a collate_fn argument to iter_batches and iter_torch_batches. This is useful for any last meter preprocessing that's done directly on the batch to be used for training. Closes ray-project#32224. Signed-off-by: amogkam <amogkamsetty@yahoo.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

amogkam added 2 commits February 9, 2023 19:41

update

6e80d1e

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

error message

103de8a

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam assigned richardliaw, clarkzinzow, bveeramani, woshiyyya and Yard1 Feb 10, 2023

gjoliver reviewed Feb 10, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

amogkam added 2 commits February 10, 2023 17:33

update

ccd04ab

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into collate_fn

62a437a

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam marked this pull request as ready for review February 11, 2023 01:35

amogkam requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners February 11, 2023 01:35

amogkam assigned maxpumperla, gjoliver and c21 Feb 11, 2023

bveeramani reviewed Feb 13, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/air/_internal/torch_utils.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

amogkam added 2 commits February 13, 2023 17:52

fix

b71a9bb

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

4853ef1

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

bveeramani approved these changes Feb 14, 2023

View reviewed changes

bveeramani added 3 commits February 14, 2023 00:59

Initial commit

86af69f

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Address review comments

c5522e3

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Update dataset_iterator.py

98f6b4d

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

clarkzinzow reviewed Feb 14, 2023

View reviewed changes

amogkam added 2 commits February 14, 2023 13:05

make private

8bf3309

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

da7fd3b

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

amogkam and others added 12 commits February 14, 2023 16:45

fix

1871ab3

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge remote-tracking branch 'upstream/master' into fix-dataset-iterator

7ad0414

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Update BUILD

ac873a0

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Merge remote-tracking branch 'upstream/master' into fix-dataset-iterator

29defb1

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

pytest

291d491

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into collate_fn

adbdd35

wip

ba9cf25

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'fix-dataset-iterator' of github.com:bveeramani/ray into…

ceec585

… collate_fn

update

73dbe0c

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into collate_fn

a81bd67

Merge branch 'master' of github.com:ray-project/ray into collate_fn

745c891

fix

427da34

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

clarkzinzow approved these changes Feb 16, 2023

View reviewed changes

python/ray/data/_internal/block_batching.py Outdated Show resolved Hide resolved

amogkam added 6 commits February 16, 2023 13:25

closure

fea9d37

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

update

3a72fac

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

fix

6b9704c

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into collate_fn

3115651

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Remove pdb

71fd7eb

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

Merge branch 'master' of github.com:ray-project/ray into collate_fn

f6ba23a

amogkam merged commit 8c054fc into ray-project:master Feb 21, 2023

amogkam deleted the collate_fn branch February 21, 2023 18:19

llan-ml mentioned this pull request Mar 21, 2023

[Data] collate_fn in iter_torch_batches could be a bottleneck #33508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR/Data] Add `collate_fn` to `iter_torch_batches` #32412

[AIR/Data] Add `collate_fn` to `iter_torch_batches` #32412

amogkam commented Feb 10, 2023 •

edited

Loading

bveeramani left a comment

clarkzinzow Feb 14, 2023 •

edited

Loading

amogkam Feb 14, 2023

amogkam Feb 14, 2023

clarkzinzow Feb 14, 2023

amogkam Feb 14, 2023

clarkzinzow Feb 14, 2023 •

edited

Loading

amogkam Feb 15, 2023

amogkam Feb 15, 2023 •

edited

Loading

amogkam Feb 15, 2023

clarkzinzow left a comment

[AIR/Data] Add collate_fn to iter_torch_batches #32412

[AIR/Data] Add collate_fn to iter_torch_batches #32412

Conversation

amogkam commented Feb 10, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

clarkzinzow Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

amogkam Feb 14, 2023

Choose a reason for hiding this comment

amogkam Feb 14, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 14, 2023

Choose a reason for hiding this comment

amogkam Feb 14, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

amogkam Feb 15, 2023

Choose a reason for hiding this comment

amogkam Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

amogkam Feb 15, 2023

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

[AIR/Data] Add `collate_fn` to `iter_torch_batches` #32412

[AIR/Data] Add `collate_fn` to `iter_torch_batches` #32412

amogkam commented Feb 10, 2023 •

edited

Loading

clarkzinzow Feb 14, 2023 •

edited

Loading

clarkzinzow Feb 14, 2023 •

edited

Loading

amogkam Feb 15, 2023 •

edited

Loading