🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

muellerzr · 2023-06-05T18:07:44Z

What does this PR do?

This PR:

Guts the internals for the DataLoader in all basic distributed fashions (replacing pl.Loader for TPU coming in a follow-up PR) to replace it with accelerator.prepare
Removes two tests that were deemed unnecessary
- Test 1 removed: tests/trainer/test_trainer.py::TrainerIntegrationTest::test_sampler_seed, deemed to no longer be necessary to reset the seed, as Accelerate's dataloader setup doesn't need any extra help when iterating/loading back in the seed, regardless of the torch version
- Test 2 removed: tests/trainer/test_trainer.py::TrainerIntegrationTest::test_training_finite_iterable_dataset, as with Accelerate's new sampler for IterableDataset we'll actually catch if it's None and raise an error, a new test will be made + clear error message on the Accelerate side, with a test added to Trainer afterwards.
Modifies two tests to use the proper attribute: Accelerator's DataLoaders all have total_batch_size rather than batch_size
- test_train_and_eval_dataloaders and test_data_is_not_parallelized_when_model_is_parallel

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @pacman100

sgugger

Thanks a ton, LGTM!

sgugger · 2023-06-05T18:27:36Z

src/transformers/trainer.py

+                for _ in train_dataloader:
+                    break


Is this still necessary with the RNG reloading?
Edit: ah looks like we are not doing that yet.

Yep not yet, so for now it's needed :)

HuggingFaceDocBuilderDev · 2023-06-05T18:36:45Z

The documentation is not available anymore as the PR was closed or merged.

pacman100

Thank you @muellerzr for all the cool work on using Accelerate for preparing dataloaders, using accelerator.gather_for_metrics and accelerator.pad_across_processes for further simplification. 🚀

Left a couple comments regarding few training arguments being ignored.

pacman100 · 2023-06-06T06:19:50Z

src/transformers/trainer.py

-                seed = self.args.data_seed
-            generator.manual_seed(seed)
-
-        seed = self.args.data_seed if self.args.data_seed is not None else self.args.seed


if the user specifies self.args.data_seed or self.args.seed, it is ignored. Would this hinder reproducible experimentation?

The tests haven't shown it to, because (from what Sylvain and I were able to find), our dataloaders don't play with the same random seed setting that these do/applying generators. Keeping this in as a generator also causes tests to fail, showing that it's detrimental in that sense now with our new DataLoader setup.

Also it's used when __iter__ through the dataset here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#L665-L667. Because we don't use a torch DistributedSampler, I don't believe it applies to us here.

If later down the road someone points to a reproducible issue to this, I'll review it again but in my deep dive last week I didn't find an issue.

pacman100 · 2023-06-06T07:28:02Z

src/transformers/trainer.py

@@ -3205,72 +3083,42 @@ def evaluation_loop(

            # Update containers on host
            if loss is not None:
-                losses = self._nested_gather(loss.repeat(batch_size))
-                losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
+                losses = self.accelerator.gather_for_metrics((loss.repeat(batch_size)))


src/transformers/trainer.py

pacman100

Thank you for your responses to the comments and for iterating on the PR. Super cool!

…ed tests 🚨🚨🚨 (huggingface#24028) * Working integration * Fix failing test * Revert label host logic * Bring it back!

franz101 · 2023-07-19T21:43:25Z

Great PR, currently this is breaking my custom collate_fn in the dataloader, still trying to understand why that is. First assumption might be due to multiprocessing?

muellerzr · 2023-07-19T21:50:31Z

@franz101 please open an issue with a reproducer of what you are trying to do so we can help :)

Working integration

3a9460e

muellerzr added External Using the library with external tools (onnx, tflite, ...) Tests Related to tests Distributed Training / Models trainer labels Jun 5, 2023

muellerzr requested review from pacman100 and sgugger June 5, 2023 18:07

Fix failing test

005e859

sgugger approved these changes Jun 5, 2023

View reviewed changes

pacman100 reviewed Jun 6, 2023

View reviewed changes

muellerzr added 2 commits June 6, 2023 10:26

Revert label host logic

71c1bf3

Bring it back!

fd6784a

muellerzr requested a review from pacman100 June 6, 2023 15:04

pacman100 approved these changes Jun 7, 2023

View reviewed changes

muellerzr merged commit ebd94b0 into main Jun 12, 2023

muellerzr deleted the muellerzr-accelerate-dataloaders-v2 branch June 12, 2023 15:23

muellerzr mentioned this pull request Jun 12, 2023

Finish dataloader integration #24201

Merged

5 tasks

jinmang2 mentioned this pull request Jun 21, 2023

Bug on Gather all remaining tensors and put them back on the CPU #24391

Closed

4 tasks

muellerzr mentioned this pull request Jul 6, 2023

Fix integration with Accelerate and failing test #24691

Merged

5 tasks

franz101 mentioned this pull request Jul 19, 2023

#24028 seems to break the last epoch for a logging integration #24939

Closed

4 tasks

muellerzr mentioned this pull request Oct 17, 2023

Bring back set_epoch for Accelerate-based dataloaders #26850

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

muellerzr commented Jun 5, 2023 •

edited

Loading

sgugger left a comment

sgugger Jun 5, 2023

muellerzr Jun 5, 2023

HuggingFaceDocBuilderDev commented Jun 5, 2023 •

edited

Loading

pacman100 left a comment •

edited

Loading

pacman100 Jun 6, 2023

muellerzr Jun 6, 2023

pacman100 Jun 6, 2023

pacman100 left a comment

franz101 commented Jul 19, 2023

muellerzr commented Jul 19, 2023

🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

Conversation

muellerzr commented Jun 5, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jun 5, 2023

Choose a reason for hiding this comment

muellerzr Jun 5, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 5, 2023 • edited Loading

pacman100 left a comment • edited Loading

Choose a reason for hiding this comment

pacman100 Jun 6, 2023

Choose a reason for hiding this comment

muellerzr Jun 6, 2023

Choose a reason for hiding this comment

pacman100 Jun 6, 2023

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

franz101 commented Jul 19, 2023

muellerzr commented Jul 19, 2023

muellerzr commented Jun 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 5, 2023 •

edited

Loading

pacman100 left a comment •

edited

Loading