Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does per_device_train_batch_size have a loss error similar to that of GA? #34579

Closed
4 tasks
glowwormX opened this issue Nov 2, 2024 · 3 comments
Closed
4 tasks
Labels

Comments

@glowwormX
Copy link

System Info

transformers4.46.1

Who can help?

@muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?

I looked at the code for ForCausalLMLoss. Such as:

def fixed_cross_entropy(source, target, num_items_in_batch: int = None, ignore_index: int = -100, **kwargs):
    reduction = "sum" if num_items_in_batch is not None else "mean"
    loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
    if reduction == "sum":
        loss = loss / num_items_in_batch
    return loss


def ForCausalLMLoss(
    logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, **kwargs
):
    # Upcast to float if we need to compute the loss to avoid potential precision issues
    logits = logits.float()
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    # Flatten the tokens
    shift_logits = shift_logits.view(-1, vocab_size)
    shift_labels = shift_labels.view(-1)
    # Enable model parallelism
    shift_labels = shift_labels.to(shift_logits.device)
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
    return loss

GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?

If the value of per_device_train_batch_size is 2 because of shift_logits.view(-1, vocab_size), the tokens of the two sequences are lost together and then averaged. If the value of per_device_train_batch_size is 1 and the number of GPUs is 2, the average loss is calculated for each batch. Finally, when the number of losses to be calculated for each sequence differs greatly, a great difference occurs.

Expected behavior

Under a global batch size, is loss averaged for all tokens or averaged for each sequence and then for all batches?
Looking forward to the answer reply

@glowwormX glowwormX added the bug label Nov 2, 2024
@glowwormX glowwormX changed the title Does per_device_train_batch_size have similar GA loss fixed? Does per_device_train_batch_size have a loss error similar to that of GA? Nov 2, 2024
@techkang
Copy link
Contributor

techkang commented Nov 4, 2024

You can refer to #34242. The bug has already been fixed in #34373.

@glowwormX
Copy link
Author

You can refer to #34242. The bug has already been fixed in #34373.

@techkang When should users turn on average_tokens_across_devices, I think the modification is not user friendly. And Under a global batch size, is loss averaged for all tokens or averaged for each sequence and then for all batches? If I want to average each sequence, it still doesn't seem to be possible.

Copy link

github-actions bot commented Dec 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants