Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the outputs in the on_train_batch_end callback? #4689

Closed
jbohnslav opened this issue Nov 15, 2020 · 7 comments · Fixed by #4369
Closed

What are the outputs in the on_train_batch_end callback? #4689

jbohnslav opened this issue Nov 15, 2020 · 7 comments · Fixed by #4369
Labels
question Further information is requested
Milestone

Comments

@jbohnslav
Copy link

❓ Questions and Help

Before asking:

  1. Try to find answers to your questions in the Lightning Forum!
  2. Search for similar issues.
  3. Search the docs.

What is your question?

For my application, I need to save the raw outputs of the model to disk for every training and validation example. I think a callback is the right thing to use for this-- PL already has hooks in "on_train_batch_end". According to the latest docs, this method takes an outputs arg, which I presume to be the outputs of the pl_module, or the value returned by the training_step function. However, no matter what I change in the training_step, outputs is always an empty list. Likewise, the outputs in on_train_epoch_end is an empty list of lists.

class SaverCallback(Callback):
    def __init__(self):
        super().__init__()

    def on_train_epoch_end(self, trainer, pl_module, outputs):
        print('train epoch outputs: {}'.format(outputs))

    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        print('train outputs: {}'.format(outputs))

    def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        # import pdb; pdb.set_trace()
        print('val outputs: {}'.format(outputs))

    def on_validation_epoch_end(self, trainer, pl_module):
        pass

Here are the relevant portions of my Lightning Module:

    def training_step(self, batch_dict, batch_i):
        ...
        return {'loss': loss, 'testing': 'testing'}

    def validation_step(self, batch_dict, batch_i):
        ...
        return {'loss': loss, 'testing': 'testing'}

Results:

train outputs: []
val outputs: {'loss': tensor(0.0395, device='cuda:0', dtype=torch.float64), 'testing': 'testing'}
train epoch outputs: [[]]

Where are train outputs defined?

Possibly related issues:
#3864
#3592
#4609

What's your environment?

  • OS: Linux
  • Packaging pip
  • Version 1.0.4, installed from master
@jbohnslav jbohnslav added the question Further information is requested label Nov 15, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@rohitgr7
Copy link
Contributor

#4369 will fix it.

@rohitgr7 rohitgr7 linked a pull request Nov 15, 2020 that will close this issue
7 tasks
@daltonhildreth
Copy link

I have a similar issue where I'm trying to make a callback for logging many of the same metrics across different modules. However, #4369 doesn't fix it when applying it to the stable 1.0.6 branch or master branch. Even when the module has training_epoch_end defined (with just a pass) this happens. With or without that function in the module (when the PR has been applied), I get [['extra':{'pred': tensor(...) }, 'minimize': tensor(...), 'meta': { ... }]].

The dict being wrong seems to be how the training step output gets processed in _process_training_step_output_1_0. Plus, it still gets returned as a list of list of the incorrect dict, rather than just the original, correct dict. A hacky fix seems to be to force the deprecated pre-1.0.0 processing. Unfortunately, this still gives a list of list of the now correct dict to on_train_batch_end, unlike what on_validation_batch_end receives.

Unless I applied #4369 wrong, but it's a pretty simple set of commits that don't seem to fix this issue.

@stale
Copy link

stale bot commented Dec 17, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 17, 2020
@stale stale bot closed this as completed Dec 24, 2020
@AsaphLightricks
Copy link

Any updates on this issue? I use PL 1.1.2 and this issue still persists. Please fix!

@rohitgr7 rohitgr7 reopened this Jan 7, 2021
@stale stale bot removed the won't fix This will not be worked on label Jan 7, 2021
@rohitgr7 rohitgr7 added this to the 1.2 milestone Jan 7, 2021
@hackgoofer
Copy link

I am seeing the same issue! Though on_validation_batch_end's outputs is returning correctly. :)

@sagewe
Copy link

sagewe commented Jan 15, 2021

I am seeing the same issue and found comments when tracking code stack. So just try adding an empty training_epoch_end implement at user defined Lightning Module and issue fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants