Update no_trainer scripts to include gradient accumulation #18436

muellerzr · 2022-08-02T21:17:53Z

Feature request

🤗 Accelerate has a gradient accumulation wrapper, and the no_trainer scripts should be updated to include it!

An example can be seen here, below is an example diff of what the integration would look like:

-     accelerator = (
-         Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator()
-     )
+     accelerator = (
+         Accelerator(log_with=args.report_to, logging_dir=args.output_dir, gradient_accumulation_steps=args.gradient_accumulation_steps) if args.with_tracking else Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
+     )

As well as:

-     num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+     num_update_steps_per_epoch = len(train_dataloader)

...


for step, batch in enumerate(train_dataloader):
+     with accelerator.accumulate(model):

-             loss = loss / args.gradient_accumulation_steps
            accelerator.backward(loss)
-             if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
                progress_bar.update(1)
                completed_steps += 1

The list of available scripts to update include:

examples/pytorch/image-classification/run_image_classification_no_trainer.py
examples/pytorch/language-modeling/run_clm_no_trainer.py
examples/pytorch/language-modeling/run_mlm_no_trainer.py
examples/pytorch/multiple-choice/run_swag_no_trainer.py
examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py
examples/pytorch/question_answering/run_qa_no_trainer.py
examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
examples/pytorch/summarization/run_summarization_no_trainer.py

Motivation

This is a great first issue for someone who wants to learn how to use some of the latest bits in Accelerate and get an easy beginner contribution to the library 🤗

Your contribution

If you decide to pick up this issue, feel free to ping myself (@muellerzr), @sgugger, or @pacman100 to review 🤗

The text was updated successfully, but these errors were encountered:

Rasmusafj · 2022-08-04T10:15:35Z

Hi @muellerzr

I took a go at this (accelerate seems awesome!), and implemented the changes quickly. However, I noticed some performance degredation when using the the gradient accumulation wrapper.

After some debugging, I think it stems from the lr_scheduler implementation in accelerate updating learning rate at every step in training loop whereas the example script updates the learning rate every optimizer step.

So I think either accelerate needs to add something like

# Otherwise, first make sure the optimizer was stepped.
for opt in self.optimizers:
       if opt.step_was_skipped or not opt.gradient_state.sync_gradients:
             return

to scheduler.py implementation at line 59

Or the script should have

if accelerator.sync_gradients:
    lr_scheduler.step()

I think this should be changed in accelerate. Let me know what you think or if im totally off! I'll be happy to do issue + PR to fix in accelerate and I'll definetly fix the example scripts in transformers. :)

sgugger · 2022-08-04T11:37:32Z

No we can't do this as then the user would have to know in advance the number of optimization steps when they create their scheduler (which they don't since Accelerate handles gradient accumulation behind the scenes). That's why the learning rate scheduler should be created with the full number of training batches prior to gradient accumulation, then stepped at each batch (which is roughly equivalent to creating it with the right number of optimization batches and step at every optimization step).

Rasmusafj · 2022-08-04T14:08:36Z

@sgugger Cool!

So if I understand you comment,

learning rate scheduler should not know anything about the actual optimization steps, but assume every batch is a step
- Hence, num_training_steps for the lr_scheduler is num_training_steps=math.ceil(len(train_dataloader)) * args.num_train_epochs, instead of taking gradient_accumulation_steps into account
- This means that if gradient_accumulation_steps is 5, we will take 4 steps of scheduling learning rate without actually using it for gradient updates

I've made a WIP pull request for the image examples/pytorch/image-classification/run_image_classification_no_trainer.py script (I'll update the rest of the scripts once i'm certain its the correct approach),

The current functionality of progress_bar / completed_steps is only increment when doing an optimization step i.e.

if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
    progress_bar.update(1)
    completed_steps += 1

So to keep the functionality, we need to know if optimization step occurred here which I think we can use

if accelerator.sync_gradients
    progress_bar.update(1)
    completed_steps += 1

but is this also something that should be kept away i.e. change logic a bit so that completed_steps == completed_batches instead of optimization_steps ?

sgugger · 2022-08-04T14:25:27Z

It's going to be easier to just have the progress bar display the number of completed steps. Also, we should multiply max_steps by the number of gradient accumulation steps for the same reason (if the user provides it).

muellerzr · 2022-08-04T15:20:43Z

I think either option would work fine as well. The reason behind sync_gradients as part of the Accelerator is to provide this open interface to perform a check like this, so from an API design it's correct.

My $0.02 is to either explain in a comment what sync_gradients checks briefly, or to do as Sylvain recommended here.

vedant-z · 2022-08-12T12:27:00Z

Hi @muellerzr opened PR #18601 for second example in the list.

Snimm · 2022-12-29T17:10:58Z

Hi @muellerzr opened a PR for 8th example on the list. Please let me know if something is wrong. (This is my first contribution ever).

mszsorondo · 2022-12-31T20:00:39Z

Hi @muellerzr!
Any script to update yet?

sameerreddy13 · 2023-02-02T19:58:33Z

Hi, I believe there is an issue with this PR (Rasmusafj:issue_18436), particularly for run_mlm_no_trainer.py. I am running BERT pretraining with this script and I run with the following arguments on 8 GPUs:
--num_warmup_steps 10000 --max_train_steps 200000 --checkpointing_steps 500 --per_device_batch_size 256 --gradient_accumulation_steps 2

When tracking the learning rate, the learning rate peaks at step 2500 (completed_steps == 2500), even though the training will stop at 200k completed_steps. My guess is the learning_rate is stepped for each of the 8 GPUs so the warmup is only actually 10k / 8 = 1.25k. Multiplied by the 2 gradient accum steps which are likely accounted for by the accumulate wrapper we end up with 2.5k warmup steps.

I saw it suggested above by @Rasmusafj that we only step the learning rate when sync_gradients is true, which I believe would solve this issue for me, and bring about the right expected behavior. I saw @sgugger recommended against this, however.

I am tracking the learning rate by printing lr_scheduler.get_last_lr()[0] every checkpointing_steps interval.
NOTE: I am using accelerate with the deepspeed plugin.

sgugger · 2023-02-02T20:43:58Z

cc @muellerzr so it's on your radar. It's True that then we use number of steps instead of number of epochs for a given training, the logic we have for the scheduler fails

Hannibal046 · 2023-05-02T14:12:27Z

I meet the same problem as @sameerreddy13

Hannibal046 · 2023-05-02T17:47:50Z

Maybe we should make it clear what does step mean in warmup_steps? one step fetching data from dataloader or one completed_step?

sameerreddy13 · 2023-05-05T15:26:55Z

It should always be one gradient update step because that is the common assumption in literature as it is tied to the learning rate scheduler. In practice if we have batch size K and grad accum A we report the effective batch size as K * A. To fully fix this issue I did the following:

lr_scheduler = get_scheduler(
    name=args.lr_scheduler_type,
    optimizer=optimizer,
    num_warmup_steps=args.num_warmup_steps * accelerator.num_processes,
    num_training_steps=args.max_train_steps * accelerator.num_processes,
)
...
if step % args.gradient_accumulation_steps != 0:
    # Gradients only accumulate
    with accelerator.no_sync(model):
        outputs = model(**batch)
        accelerator.backward(outputs.loss)
    else:
        # Gradients finally sync
        outputs = model(**batch)
        accelerator.backward(outputs.loss)
        optimizer.step()
        optimizer.zero_grad()
        if (
            completed_steps < args.num_warmup_steps
            or lr_scheduler.get_last_lr()[0] > args.min_learning_rate
        ):
            lr_scheduler.step()

sameerreddy13 · 2023-05-05T15:30:15Z

It's been a while since I made this change but I manually used no_sync. iirc there was some underlying issue with the accelerator.accumulate(model) . I believe when I did a validation loop inside the training loop (say every K batches you want to get validation loss) that this broke the gradient accumulation, and only one gradient accum step would happen irregardless of the configured argument. You can see this at a coarse grained level by putting a validation step inside the train loop, setting grad_accum to something like 4 and observing the training suddenly speed up after the first evaluation.

Hannibal046 · 2023-05-06T07:45:08Z

@sameerreddy13 , I agree with you. I also write a snippet about this at huggingface/accelerate#1382 (comment) with two different points:

first, I initialize my lr_scheduler without *accelerate.num_processes and not pass it to prepare, do you think this is equivalent to yours?
I still use accelerator.accumulate(model) because I didn't notice the underlying issue, if that is really the case, what about only validating after certain completed steps rather than certain batches ?

sandeepchittilla · 2023-07-18T10:24:25Z

is this issue still open? can the relevant people mark which PRs have been are merged/w.i.p ?

I see there is #18601 from @vedant-z but it's been closed?

lzy37ld · 2023-10-28T05:31:19Z

Sry, any update or final answer here?

Petroncini · 2024-09-09T13:29:12Z

is this issue still open, @muellerzr? Im looking for a good first issue to tackle as a first time contributor

muellerzr added Examples Which is related to examples in general Good First Issue labels Aug 2, 2022

Rasmusafj mentioned this issue Aug 4, 2022

Update no_trainer.py scripts to include accelerate gradient accumulation wrapper #18473

Merged

vedant-z mentioned this issue Aug 12, 2022

Update run_mlm_no_trainer.py #18601

Closed

5 tasks

Snimm mentioned this issue Dec 29, 2022

Update run_wav2vec2_pretraining_no_trainer.py #20940

Closed

5 tasks

Hannibal046 mentioned this issue May 2, 2023

Why do we need to take care of num_process in lr_scheduler? huggingface/accelerate#1382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update no_trainer scripts to include gradient accumulation #18436

Update no_trainer scripts to include gradient accumulation #18436

muellerzr commented Aug 2, 2022 •

edited by pacman100

Loading

Rasmusafj commented Aug 4, 2022

sgugger commented Aug 4, 2022

Rasmusafj commented Aug 4, 2022

sgugger commented Aug 4, 2022

muellerzr commented Aug 4, 2022

vedant-z commented Aug 12, 2022

Snimm commented Dec 29, 2022

mszsorondo commented Dec 31, 2022

sameerreddy13 commented Feb 2, 2023 •

edited

Loading

sgugger commented Feb 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

sameerreddy13 commented May 5, 2023 •

edited

Loading

sameerreddy13 commented May 5, 2023 •

edited

Loading

Hannibal046 commented May 6, 2023

sandeepchittilla commented Jul 18, 2023

lzy37ld commented Oct 28, 2023

Petroncini commented Sep 9, 2024

Update no_trainer scripts to include gradient accumulation #18436

Update no_trainer scripts to include gradient accumulation #18436

Comments

muellerzr commented Aug 2, 2022 • edited by pacman100 Loading

Feature request

Motivation

Your contribution

Rasmusafj commented Aug 4, 2022

sgugger commented Aug 4, 2022

Rasmusafj commented Aug 4, 2022

sgugger commented Aug 4, 2022

muellerzr commented Aug 4, 2022

vedant-z commented Aug 12, 2022

Snimm commented Dec 29, 2022

mszsorondo commented Dec 31, 2022

sameerreddy13 commented Feb 2, 2023 • edited Loading

sgugger commented Feb 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

sameerreddy13 commented May 5, 2023 • edited Loading

sameerreddy13 commented May 5, 2023 • edited Loading

Hannibal046 commented May 6, 2023

sandeepchittilla commented Jul 18, 2023

lzy37ld commented Oct 28, 2023

Petroncini commented Sep 9, 2024

muellerzr commented Aug 2, 2022 •

edited by pacman100

Loading

sameerreddy13 commented Feb 2, 2023 •

edited

Loading

sameerreddy13 commented May 5, 2023 •

edited

Loading

sameerreddy13 commented May 5, 2023 •

edited

Loading