-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update no_trainer scripts to include gradient accumulation #18436
Comments
Hi @muellerzr I took a go at this (accelerate seems awesome!), and implemented the changes quickly. However, I noticed some performance degredation when using the the gradient accumulation wrapper. After some debugging, I think it stems from the lr_scheduler implementation in accelerate updating learning rate at every step in training loop whereas the example script updates the learning rate every optimizer step. So I think either accelerate needs to add something like # Otherwise, first make sure the optimizer was stepped.
for opt in self.optimizers:
if opt.step_was_skipped or not opt.gradient_state.sync_gradients:
return to scheduler.py implementation at line 59 Or the script should have if accelerator.sync_gradients:
lr_scheduler.step() I think this should be changed in accelerate. Let me know what you think or if im totally off! I'll be happy to do issue + PR to fix in accelerate and I'll definetly fix the example scripts in transformers. :) |
No we can't do this as then the user would have to know in advance the number of optimization steps when they create their scheduler (which they don't since Accelerate handles gradient accumulation behind the scenes). That's why the learning rate scheduler should be created with the full number of training batches prior to gradient accumulation, then stepped at each batch (which is roughly equivalent to creating it with the right number of optimization batches and step at every optimization step). |
@sgugger Cool! So if I understand you comment,
I've made a WIP pull request for the image examples/pytorch/image-classification/run_image_classification_no_trainer.py script (I'll update the rest of the scripts once i'm certain its the correct approach),
if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
progress_bar.update(1)
completed_steps += 1 So to keep the functionality, we need to know if optimization step occurred here which I think we can use if accelerator.sync_gradients
progress_bar.update(1)
completed_steps += 1 but is this also something that should be kept away i.e. change logic a bit so that completed_steps == completed_batches instead of optimization_steps ? |
It's going to be easier to just have the progress bar display the number of completed steps. Also, we should multiply |
I think either option would work fine as well. The reason behind My $0.02 is to either explain in a comment what |
Hi @muellerzr opened PR #18601 for second example in the list. |
Hi @muellerzr opened a PR for 8th example on the list. Please let me know if something is wrong. (This is my first contribution ever). |
Hi @muellerzr! |
Hi, I believe there is an issue with this PR (Rasmusafj:issue_18436), particularly for run_mlm_no_trainer.py. I am running BERT pretraining with this script and I run with the following arguments on 8 GPUs: When tracking the learning rate, the learning rate peaks at step 2500 ( I saw it suggested above by @Rasmusafj that we only step the learning rate when sync_gradients is true, which I believe would solve this issue for me, and bring about the right expected behavior. I saw @sgugger recommended against this, however. I am tracking the learning rate by printing |
cc @muellerzr so it's on your radar. It's True that then we use number of steps instead of number of epochs for a given training, the logic we have for the scheduler fails |
I meet the same problem as @sameerreddy13 |
Maybe we should make it clear what does |
It should always be one gradient update step because that is the common assumption in literature as it is tied to the learning rate scheduler. In practice if we have batch size K and grad accum A we report the effective batch size as K * A. To fully fix this issue I did the following:
|
It's been a while since I made this change but I manually used |
@sameerreddy13 , I agree with you. I also write a snippet about this at huggingface/accelerate#1382 (comment) with two different points:
|
Sry, any update or final answer here? |
is this issue still open, @muellerzr? Im looking for a good first issue to tackle as a first time contributor |
Feature request
🤗 Accelerate has a gradient accumulation wrapper, and the
no_trainer
scripts should be updated to include it!An example can be seen here, below is an example diff of what the integration would look like:
As well as:
The list of available scripts to update include:
Motivation
This is a great first issue for someone who wants to learn how to use some of the latest bits in Accelerate and get an easy beginner contribution to the library 🤗
Your contribution
If you decide to pick up this issue, feel free to ping myself (@muellerzr), @sgugger, or @pacman100 to review 🤗
The text was updated successfully, but these errors were encountered: