Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training_epoch_end's outputs doesn't have 'loss' key #2372

Closed
xiadingZ opened this issue Jun 26, 2020 · 13 comments · Fixed by #2428
Closed

training_epoch_end's outputs doesn't have 'loss' key #2372

xiadingZ opened this issue Jun 26, 2020 · 13 comments · Fixed by #2428
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task

Comments

@xiadingZ
Copy link

xiadingZ commented Jun 26, 2020

pytorch-lightning: build from master

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    main(hparams)
  File "main.py", line 72, in main
    trainer.fit(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 881, in fit
    self.ddp_train(task, model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 539, in ddp_train
    self.run_pretrain_routine(model)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1091, in run_pretrain_routine
    self.train()
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 376, in train
    self.run_training_epoch()
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 510, in run_training_epoch
    self.run_training_epoch_end(epoch_output)
  File "/mnt/lustre/maxiao1/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 535, in run_training_epoch_end
    epoch_output = model.training_epoch_end(epoch_output)
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 335, in training_epoch_end
    avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
  File "/mnt/lustre/maxiao1/PVM/models/baseline.py", line 335, in <listcomp>
    avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
KeyError: 'loss'

This is my code:

    def training_step(self, batch, batch_idx):
        ...
        return {'loss': loss, "train_acc": acc}

    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['train_acc'] for x in outputs]).mean()
        logs = {'loss': avg_loss, 'train_acc': avg_acc}
        progress_bar = {'train_loss': avg_loss, 'train_acc': avg_acc}
        results = {
            'log': logs,
            'progress_bar': progress_bar
        }
        return results
@xiadingZ xiadingZ added bug Something isn't working help wanted Open to be worked on labels Jun 26, 2020
@williamFalcon williamFalcon self-assigned this Jun 26, 2020
@williamFalcon williamFalcon added the priority: 0 High priority task label Jun 26, 2020
@rohitgr7
Copy link
Contributor

Try: avg_loss = torch.stack([x['batch_loss'] for x in outputs]).mean()

@xiadingZ
Copy link
Author

xiadingZ commented Jun 27, 2020

Thanks, it works
but 'train_acc' key doesn't exist, neither do batch_train_acc. How to access other keys returned in training_step?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jun 27, 2020

As of now in lightning you can access them using x['callback_metrics']['loss'] and x['callback_metrics']['train_acc'], but I think it should be handled in a similar way we do this with validation_epoch_end and test_epoch_end.

@Pet222
Copy link

Pet222 commented Jun 29, 2020

Hi! One hint: for me it works with "loss" under windows but not under ubuntu.

@rohitgr7
Copy link
Contributor

Weird!! Why is this think platform dependent?? 🤔

@Red-Eyed
Copy link
Contributor

@Pet222 , are u sure that versions on ubuntu and windows are same?

@captainvera
Copy link

Hey @williamFalcon is this intended behaviour? I was surprised to see this breaking change being introduced with no warning.
If it is intended, why not have consistent behaviour over validation_epoch_end and test_epoch_end.

If it is not intended, as it seems due to the "bug fix" tag, are you working on it or should I make a PR for this?

@williamFalcon
Copy link
Contributor

what is the behavior? that the "loss" key is not in training_epoch_end? If so, that's a bug because it should be there

@Red-Eyed
Copy link
Contributor

Red-Eyed commented Jun 30, 2020

@williamFalcon , on the latest version, the loss key was changed to the batch_loss. I think it was changed here

@captainvera
Copy link

captainvera commented Jun 30, 2020 via email

@williamFalcon
Copy link
Contributor

@captainvera would love a PR :)

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 30, 2020

@captainvera @xiadingZ sorry about that! it was a bad bug.

Made a PR #2428 and added tests to make sure this doesn't happen again!

try master now!
we’ll push a new minor again since this is a key bug (and we have a few other key bugs)

@captainvera
Copy link

Well, that was fast, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants