You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?
I looked at the code for ForCausalLMLoss. Such as:
deffixed_cross_entropy(source, target, num_items_in_batch: int=None, ignore_index: int=-100, **kwargs):
reduction="sum"ifnum_items_in_batchisnotNoneelse"mean"loss=nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
ifreduction=="sum":
loss=loss/num_items_in_batchreturnlossdefForCausalLMLoss(
logits, labels, vocab_size: int, num_items_in_batch: int=None, ignore_index: int=-100, **kwargs
):
# Upcast to float if we need to compute the loss to avoid potential precision issueslogits=logits.float()
# Shift so that tokens < n predict nshift_logits=logits[..., :-1, :].contiguous()
shift_labels=labels[..., 1:].contiguous()
# Flatten the tokensshift_logits=shift_logits.view(-1, vocab_size)
shift_labels=shift_labels.view(-1)
# Enable model parallelismshift_labels=shift_labels.to(shift_logits.device)
loss=fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
returnloss
GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?
If the value of per_device_train_batch_size is 2 because of shift_logits.view(-1, vocab_size), the tokens of the two sequences are lost together and then averaged. If the value of per_device_train_batch_size is 1 and the number of GPUs is 2, the average loss is calculated for each batch. Finally, when the number of losses to be calculated for each sequence differs greatly, a great difference occurs.
Expected behavior
Under a global batch size, is loss averaged for all tokens or averaged for each sequence and then for all batches?
Looking forward to the answer reply
The text was updated successfully, but these errors were encountered:
glowwormX
changed the title
Does per_device_train_batch_size have similar GA loss fixed?
Does per_device_train_batch_size have a loss error similar to that of GA?
Nov 2, 2024
You can refer to #34242. The bug has already been fixed in #34373.
@techkang When should users turn on average_tokens_across_devices, I think the modification is not user friendly. And Under a global batch size, is loss averaged for all tokens or averaged for each sequence and then for all batches? If I want to average each sequence, it still doesn't seem to be possible.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers4.46.1
Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?
I looked at the code for ForCausalLMLoss. Such as:
GA loss fixes the average loss of multiple training steps. If per_device_train_batch_size is set to 2 for 1 gpu and per_device_train_batch_size for 2 gpu, will it be different?
If the value of per_device_train_batch_size is 2 because of
shift_logits.view(-1, vocab_size)
, the tokens of the two sequences are lost together and then averaged. If the value of per_device_train_batch_size is 1 and the number of GPUs is 2, the average loss is calculated for each batch. Finally, when the number of losses to be calculated for each sequence differs greatly, a great difference occurs.Expected behavior
Under a global batch size, is loss averaged for all tokens or averaged for each sequence and then for all batches?
Looking forward to the answer reply
The text was updated successfully, but these errors were encountered: