You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm summing together multiple different losses using ddp on a single machine, 2 gpus.
I've been struggling to reduce my loss to zero as a sanity check on a subset of my images.
Is there something I should be calling to synchronise loss across gpus?
I've done this with MNIST no worries.
My model output is a dictionary with 8 components and I'm calling F.nll_loss on each of them before summing together. (One training example consists of 4 images and each example can have zero, 1 or 2 classes)
losses = [lossLCCb, lossRCCb, lossLCCca, lossRCCca, lossLMLOb, lossRMLOb, lossLMLOca, lossRMLOca]
for loss in losses:
loss = dist.all_reduce(loss)
loss /= dist.get_world_size()
Hi,
I'm summing together multiple different losses using ddp on a single machine, 2 gpus.
I've been struggling to reduce my loss to zero as a sanity check on a subset of my images.
Is there something I should be calling to synchronise loss across gpus?
I've done this with MNIST no worries.
My model output is a dictionary with 8 components and I'm calling F.nll_loss on each of them before summing together. (One training example consists of 4 images and each example can have zero, 1 or 2 classes)
Code
Both my training and validation steps are like:
What have you tried?
I've tried each of them following: (before sum)
and after sum
Neither make any difference.
What's your environment?
torch 1.5.0
torchvision 0.6.0
Any tips / thoughts much appreciated. Cheers.
The text was updated successfully, but these errors were encountered: