What are the `outputs` in the `on_train_batch_end` callback? #4689

jbohnslav · 2020-11-15T21:35:21Z

❓ Questions and Help

Before asking:

Try to find answers to your questions in the Lightning Forum!
Search for similar issues.
Search the docs.

What is your question?

For my application, I need to save the raw outputs of the model to disk for every training and validation example. I think a callback is the right thing to use for this-- PL already has hooks in "on_train_batch_end". According to the latest docs, this method takes an outputs arg, which I presume to be the outputs of the pl_module, or the value returned by the training_step function. However, no matter what I change in the training_step, outputs is always an empty list. Likewise, the outputs in on_train_epoch_end is an empty list of lists.

class SaverCallback(Callback):
    def __init__(self):
        super().__init__()

    def on_train_epoch_end(self, trainer, pl_module, outputs):
        print('train epoch outputs: {}'.format(outputs))

    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        print('train outputs: {}'.format(outputs))

    def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        # import pdb; pdb.set_trace()
        print('val outputs: {}'.format(outputs))

    def on_validation_epoch_end(self, trainer, pl_module):
        pass

Here are the relevant portions of my Lightning Module:

    def training_step(self, batch_dict, batch_i):
        ...
        return {'loss': loss, 'testing': 'testing'}

    def validation_step(self, batch_dict, batch_i):
        ...
        return {'loss': loss, 'testing': 'testing'}

Results:

train outputs: []
val outputs: {'loss': tensor(0.0395, device='cuda:0', dtype=torch.float64), 'testing': 'testing'}
train epoch outputs: [[]]

Where are train outputs defined?

Possibly related issues:
#3864
#3592
#4609

What's your environment?

OS: Linux
Packaging pip
Version 1.0.4, installed from master

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-15T21:36:01Z

Hi! thanks for your contribution!, great first issue!

rohitgr7 · 2020-11-15T21:57:58Z

#4369 will fix it.

daltonhildreth · 2020-11-17T21:03:41Z

I have a similar issue where I'm trying to make a callback for logging many of the same metrics across different modules. However, #4369 doesn't fix it when applying it to the stable 1.0.6 branch or master branch. Even when the module has training_epoch_end defined (with just a pass) this happens. With or without that function in the module (when the PR has been applied), I get [['extra':{'pred': tensor(...) }, 'minimize': tensor(...), 'meta': { ... }]].

The dict being wrong seems to be how the training step output gets processed in _process_training_step_output_1_0. Plus, it still gets returned as a list of list of the incorrect dict, rather than just the original, correct dict. A hacky fix seems to be to force the deprecated pre-1.0.0 processing. Unfortunately, this still gives a list of list of the now correct dict to on_train_batch_end, unlike what on_validation_batch_end receives.

Unless I applied #4369 wrong, but it's a pretty simple set of commits that don't seem to fix this issue.

stale · 2020-12-17T21:12:25Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

AsaphLightricks · 2021-01-07T10:58:27Z

Any updates on this issue? I use PL 1.1.2 and this issue still persists. Please fix!

hackgoofer · 2021-01-15T03:05:11Z

I am seeing the same issue! Though on_validation_batch_end's outputs is returning correctly. :)

sagewe · 2021-01-15T11:41:57Z

I am seeing the same issue and found comments when tracking code stack. So just try adding an empty training_epoch_end implement at user defined Lightning Module and issue fixed.

jbohnslav added the question Further information is requested label Nov 15, 2020

rohitgr7 linked a pull request Nov 15, 2020 that will close this issue

passing batch outputs to on_train_batch_end #4369

Merged

7 tasks

stale bot added the won't fix This will not be worked on label Dec 17, 2020

stale bot closed this as completed Dec 24, 2020

rohitgr7 reopened this Jan 7, 2021

stale bot removed the won't fix This will not be worked on label Jan 7, 2021

rohitgr7 added this to the 1.2 milestone Jan 7, 2021

rsaite mentioned this issue Jan 14, 2021

Pass epoch outputs to callback hooks on_validation_epoch_end and on_test_epoch_end #5508

Closed

SeanNaren closed this as completed in #4369 Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the `outputs` in the `on_train_batch_end` callback? #4689

What are the `outputs` in the `on_train_batch_end` callback? #4689

jbohnslav commented Nov 15, 2020

github-actions bot commented Nov 15, 2020

rohitgr7 commented Nov 15, 2020

daltonhildreth commented Nov 17, 2020

stale bot commented Dec 17, 2020

AsaphLightricks commented Jan 7, 2021

hackgoofer commented Jan 15, 2021

sagewe commented Jan 15, 2021 •

edited

Loading

What are the outputs in the on_train_batch_end callback? #4689

What are the outputs in the on_train_batch_end callback? #4689

Comments

jbohnslav commented Nov 15, 2020

❓ Questions and Help

Before asking:

What is your question?

What's your environment?

github-actions bot commented Nov 15, 2020

rohitgr7 commented Nov 15, 2020

daltonhildreth commented Nov 17, 2020

stale bot commented Dec 17, 2020

AsaphLightricks commented Jan 7, 2021

hackgoofer commented Jan 15, 2021

sagewe commented Jan 15, 2021 • edited Loading

What are the `outputs` in the `on_train_batch_end` callback? #4689

What are the `outputs` in the `on_train_batch_end` callback? #4689

sagewe commented Jan 15, 2021 •

edited

Loading