Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode #435

hebiao064 · 2024-12-08T00:13:11Z

Summary

Closes #421

This pull request addresses an issue in the cross_entropy_forward function where the reduction="none" mode did not behave as expected.

Previously, the function always returned a single scalar value, even when reduction="none" was specified. This update ensures that when reduction="none" is used, the function directly outputs the unreduced loss array (loss_1d) instead of summing it.

Changes Made:

Added a condition to handle reduction="none", ensuring the function outputs loss_1d directly.
Updated the computation of z_loss to respect the reduction="none" mode.
Add test for cases when reduction="none"

Why we pass `gradient` to `output.backward()`?

Background on Gradients in PyTorch

Scalar Outputs: When a tensor is a scalar (a single number), PyTorch can compute gradients automatically by assuming the scalar has an implicit gradient of 1.0.
Non-Scalar Outputs: For tensors that are not scalars, gradients must be provided explicitly because PyTorch cannot infer the shape or distribution of gradients. Without this, it raises the error: "grad can be implicitly created only for scalar outputs."

Why reduction="none" Needs Explicit Gradients

When reduction="none", the loss function does not reduce the per-example loss values into a single scalar. Instead, it outputs a vector of losses, with one value per example in the batch. This means that the loss tensor has multiple values, and PyTorch cannot assume what the gradient for each of these values should be unless explicitly provided.

The Fix

By passing gradient=torch.ones_like(loss) to backward():

Gradient Tensor: The torch.ones_like(loss) serves as the gradient tensor. It specifies that each element in the loss tensor contributes equally to the gradients during backpropagation.
Shape Match: The gradient tensor's shape matches the loss tensor's shape, fulfilling PyTorch's requirements for non-scalar outputs during backward().

Testing Done

make test

pytest /home/jobuser/Liger-Kernel/test/transformers/test_cross_entropy.py shows:

=================================== 93 passed, 1 warning in 13.18s ===================================

Hardware Type: NVIDIA A100-SXM4-80GB
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

src/liger_kernel/ops/cross_entropy.py

ByronHsu · 2024-12-10T00:21:31Z

Awesome work! Please push branch to the main repo next time, so you can run CI directly. We have disabled CI from the outside fork

hebiao064 added 3 commits December 7, 2024 23:51

Fix LigerCrossEntropyLoss Reduction Behavior for None Mode

777d6ad

Add tests

c18c73f

Fix tests by passing explicit gradient

70d0c55

hebiao064 mentioned this pull request Dec 8, 2024

CrossEntropyLoss return single value when reduction is "none" #421

Closed

Tcc0403 reviewed Dec 8, 2024

View reviewed changes

src/liger_kernel/ops/cross_entropy.py Outdated Show resolved Hide resolved

add none reduction option for z_loss

f3aed04

Tcc0403 approved these changes Dec 8, 2024

View reviewed changes

Tcc0403 requested a review from ByronHsu December 8, 2024 10:56

ByronHsu merged commit d790b64 into linkedin:main Dec 10, 2024
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode #435

Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode #435

hebiao064 commented Dec 8, 2024 •

edited

Loading

ByronHsu commented Dec 10, 2024

Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode #435

Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode #435

Conversation

hebiao064 commented Dec 8, 2024 • edited Loading

Summary

Changes Made:

Why we pass gradient to output.backward()?

Background on Gradients in PyTorch

Why reduction="none" Needs Explicit Gradients

The Fix

Testing Done

ByronHsu commented Dec 10, 2024

hebiao064 commented Dec 8, 2024 •

edited

Loading

Why we pass `gradient` to `output.backward()`?