Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

Closed
sbwww opened this issue Oct 18, 2024 · 11 comments · Fixed by #34283 or #34373
Closed

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

sbwww opened this issue Oct 18, 2024 · 11 comments · Fixed by #34283 or #34373
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) Feature request Request for a new feature

Comments

@sbwww
Copy link

sbwww commented Oct 18, 2024

Feature request

Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue.


Expected behaivor

With all the tokens contributing to loss in each step (in each GPU, gradient accumulation step, and microbatch), the equation becomes:

$$ntokens=\sum\limits_{GPUs} \sum\limits_{gas} \sum\limits_{microb} (label\neq-100)$$

I believe we should average the above tokens at the same time for equivalent non-parallel training.


Current issue

Prior to #34191, the loss/gradients were averaged on $\sum\limits_{GPUs}$, $\sum\limits_{gas}$, and $\sum\limits_{microb}$ separately. And, the introduction of num_items_in_batch in #34191 refers to:

$$ntokens=\sum\limits_{gas} \sum\limits_{microb} (label\neq-100)$$

So, the loss/gradients are now averaged on $\sum\limits_{GPUs}$ and $\left(\sum\limits_{gas}\sum\limits_{microb}\right)$ separately. However, this still does not seem equivalent to non-parallel training.

Can we also incorporate $\sum\limits_{GPUs}$ when determining num_items_in_batch? Something like all_reduce(num_items_in_batch)?

Motivation

DDP seems not fully equivalent to non-parallel training.

related comments: #34191 (comment)

Your contribution

Found some fairseq implementation of this feature

https://github.com/facebookresearch/fairseq/blob/018621f3cca02ca9de945dc082c3fb1a7f9f2deb/fairseq/trainer.py#L932-L949

@sbwww sbwww added the Feature request Request for a new feature label Oct 18, 2024
@muellerzr
Copy link
Contributor

muellerzr commented Oct 18, 2024

I observed this as well when I was running some experiments (things were close postfix, but not exact). Would you like to take a stab at a PR? :)

@LysandreJik LysandreJik added the Discussion Discussion on a topic (keep it focused or open a new issue though) label Oct 21, 2024
@techkang
Copy link
Contributor

A simple implemention may be:

  1. add all_reduce(num_items_in_batch, op=SUM) after: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2416
  2. add loss *= get_world_size() after: https://github.com/huggingface/transformers/blob/main/src/transformers/loss/loss_utils.py#L26

@TechxGenus
Copy link
Contributor

TechxGenus commented Oct 22, 2024

Although this issue has little impact on the training results, it significantly affects to reproduce experiments across different hardware configurations. I hope it can be resolved alongside gradient accumulation.

I attempted to use all-reduce during training, but it slowed down the process. Is it possible to calculate the total number of tokens per batch across devices when initializing the Dataloader with accelerate (without compromising compatibility with the existing code) ?

@muellerzr
Copy link
Contributor

muellerzr commented Oct 22, 2024

That is the issue with it, and why I'm not the biggest fan of that particular solution.

We can't, bc there are situations like IterableDatasets where that just cannot be possible.

The fairseq solution may be the way

@muellerzr
Copy link
Contributor

W B Chart 10_22_2024, 10_20_21 AM

Can confirm the fairseq solution works great, it'll be part of #34283

@muellerzr
Copy link
Contributor

muellerzr commented Oct 22, 2024

This however does not make any impact as we scale (current fix or these ones)
image

This might be problem specific, however I did find the fix helped a little

@muellerzr
Copy link
Contributor

I'll leave this open for now. I didn't see significant discrepancies between DDP and non, but if users have stories/can show where it goes wrong, post them here for us to dig into please

@techkang
Copy link
Contributor

I tested multi-gpu performance with or without all_reduce.
First, with num_items_in_batch = all_reduce(num_items_in_batch), all loss curves matched exactly.
image
Then, compared with the version without all_reduce, the loss curves mismatched:
image
However, the loss value was only slightly larger at the beginning. So introducing DDP all_reduce sync may only improve reproducibility.

Here is my code if anyone in need:

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        if torch.cuda.device_count() > 1 and num_items_in_batch is not None:
            num_items_in_batch_torch = torch.tensor([num_items_in_batch]).cuda()
            torch.distributed.all_reduce(num_items_in_batch_torch, op=torch.distributed.ReduceOp.SUM)
            num_items_in_batch = int(num_items_in_batch_torch.cpu())

        loss = super().compute_loss(model, inputs, return_outputs=return_outputs, num_items_in_batch=num_items_in_batch)
        if torch.cuda.device_count() > 1 and num_items_in_batch is not None:
            loss *= torch.cuda.device_count()
        return loss

@muellerzr
Copy link
Contributor

What we can do then is add it in under a flag which is disabled by default (average_tokens_across_devices) into the TrainingArguments. @techkang want to take a stab at a PR?

@techkang
Copy link
Contributor

Thanks, I already created the pull request and tested on my machine.
image

@hedes1992
Copy link

A simple implemention may be:

  1. add all_reduce(num_items_in_batch, op=SUM) after: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2416
  2. add loss *= get_world_size() after: https://github.com/huggingface/transformers/blob/main/src/transformers/loss/loss_utils.py#L26

yyds,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) Feature request Request for a new feature
Projects
None yet
6 participants