You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using self.all_gather in training_step to gather tensor with gradient function to compute and return loss throwing RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1)
I think the bug is that the forward(ctx, tensor, group=group.WORLD) in distributed.AllGatherGrad function has two arguments but the backward(ctx, *grad_output) only returns one output.
🐛 Bug
When using
self.all_gather
intraining_step
to gather tensor with gradient function to compute and return loss throwingRuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1)
I think the bug is that the
forward(ctx, tensor, group=group.WORLD)
indistributed.AllGatherGrad
function has two arguments but thebackward(ctx, *grad_output)
only returns one output.The error can be solved by changing
return grad_output[torch.distributed.get_rank()]
toreturn grad_output[torch.distributed.get_rank()], None
Environment
The text was updated successfully, but these errors were encountered: