Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028

Merged
merged 4 commits into from
Jun 12, 2023

Conversation

muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Jun 5, 2023

What does this PR do?

This PR:

  • Guts the internals for the DataLoader in all basic distributed fashions (replacing pl.Loader for TPU coming in a follow-up PR) to replace it with accelerator.prepare
  • Removes two tests that were deemed unnecessary
    • Test 1 removed: tests/trainer/test_trainer.py::TrainerIntegrationTest::test_sampler_seed, deemed to no longer be necessary to reset the seed, as Accelerate's dataloader setup doesn't need any extra help when iterating/loading back in the seed, regardless of the torch version
    • Test 2 removed: tests/trainer/test_trainer.py::TrainerIntegrationTest::test_training_finite_iterable_dataset, as with Accelerate's new sampler for IterableDataset we'll actually catch if it's None and raise an error, a new test will be made + clear error message on the Accelerate side, with a test added to Trainer afterwards.
  • Modifies two tests to use the proper attribute: Accelerator's DataLoaders all have total_batch_size rather than batch_size
    • test_train_and_eval_dataloaders and test_data_is_not_parallelized_when_model_is_parallel

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger @pacman100

@muellerzr muellerzr added External Using the library with external tools (onnx, tflite, ...) Tests Related to tests Distributed Training / Models trainer labels Jun 5, 2023
@muellerzr muellerzr requested review from pacman100 and sgugger June 5, 2023 18:07
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton, LGTM!

Comment on lines +1745 to +1746
for _ in train_dataloader:
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary with the RNG reloading?
Edit: ah looks like we are not doing that yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep not yet, so for now it's needed :)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 5, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @muellerzr for all the cool work on using Accelerate for preparing dataloaders, using accelerator.gather_for_metrics and accelerator.pad_across_processes for further simplification. 🚀

Left a couple comments regarding few training arguments being ignored.

seed = self.args.data_seed
generator.manual_seed(seed)

seed = self.args.data_seed if self.args.data_seed is not None else self.args.seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user specifies self.args.data_seed or self.args.seed, it is ignored. Would this hinder reproducible experimentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests haven't shown it to, because (from what Sylvain and I were able to find), our dataloaders don't play with the same random seed setting that these do/applying generators. Keeping this in as a generator also causes tests to fail, showing that it's detrimental in that sense now with our new DataLoader setup.

Also it's used when __iter__ through the dataset here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#L665-L667. Because we don't use a torch DistributedSampler, I don't believe it applies to us here.

If later down the road someone points to a reproducible issue to this, I'll review it again but in my deep dive last week I didn't find an issue.

@@ -3205,72 +3083,42 @@ def evaluation_loop(

# Update containers on host
if loss is not None:
losses = self._nested_gather(loss.repeat(batch_size))
losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
losses = self.accelerator.gather_for_metrics((loss.repeat(batch_size)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/transformers/trainer.py Outdated Show resolved Hide resolved
@muellerzr muellerzr requested a review from pacman100 June 6, 2023 15:04
Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your responses to the comments and for iterating on the PR. Super cool!

@muellerzr muellerzr merged commit ebd94b0 into main Jun 12, 2023
@muellerzr muellerzr deleted the muellerzr-accelerate-dataloaders-v2 branch June 12, 2023 15:23
@muellerzr muellerzr mentioned this pull request Jun 12, 2023
5 tasks
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ed tests 🚨🚨🚨 (huggingface#24028)

* Working integration

* Fix failing test

* Revert label host logic

* Bring it back!
@franz101
Copy link

Great PR, currently this is breaking my custom collate_fn in the dataloader, still trying to understand why that is. First assumption might be due to multiprocessing?

@muellerzr
Copy link
Contributor Author

@franz101 please open an issue with a reproducer of what you are trying to do so we can help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Distributed Training / Models External Using the library with external tools (onnx, tflite, ...) Tests Related to tests trainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants