Output from `validation_step_end` not propagated to `on_validation_batch_end` #9608

jzazo · 2021-09-20T13:59:58Z

🐛 Bug

Output from validation_step_end not propagated to on_validation_batch_end; rather it receives from validation_step. Same happens for the test counterpart.

The documentation states outputs (Union[Tensor, Dict[str, Any], None]) – The outputs of validation_step_end(validation_step(x)), but it is not set like that.

To Reproduce

MWE with boring model: https://colab.research.google.com/drive/1xIq7dj6aY6ZPSEPlBF0dPGI87wwkLepL?usp=sharing

Expected behavior

The output should be that from validation_step_end, not from validation_step.

Environment

PyTorch Lightning Version: 1.4.5
PyTorch Version: 1.9.0
Python version: 3.8.5
OS: Ubuntu 20.04
CUDA/cuDNN version: 10.2
GPU models and configuration: Tesla K80
How you installed PyTorch: pipenv

The text was updated successfully, but these errors were encountered:

jzazo · 2021-09-20T14:47:25Z

The problem with this bug is that I was using validation_step_end to clear the output of validation steps so that the GPU memory footprint during a validation epoch is smaller (so to avoid getting cuda OOM during validation).

tchaton · 2021-09-20T16:37:31Z

@carmocca

carmocca · 2021-09-21T14:07:15Z

git bisect points to this PR breaking the feature: #7826

carmocca · 2021-09-21T14:26:00Z

The problem is that we check None as a sentinel value to know whether the hook had no output, but since your example is explicitly returning None, we pass what we had previously.

https://github.com/PyTorchLightning/pytorch-lightning/blob/61b4e33d949fa8f0e8b11ae196271368400445f1/pytorch_lightning/trainer/trainer.py#L1343

If you had:

    def validation_step_end(self, *args, **kwargs):
        return "something_that_is_not_None"

Then you'll see it working properly.

Can you elaborate on why you need to return None?

jzazo · 2021-09-21T14:53:24Z

I can return something else, but I was afraid that the GPU´s memory would fill up, so I was thinking that returning None would be better. If I had an epoch of 1M steps I am not sure what the impact would be on gpu memory.

It is true that during training_step_end I am returning the loss and the memory does not fill up, so if I did the same during validation it could work. Trying that now.

Because of what you describe it looks like a hard fix, doesn't it? It would not be easy to discriminate when training_step_end has been extended or not if I keep returning None? Would it help returning an empty dict?

carmocca · 2021-09-21T15:25:07Z

I was afraid that the GPU´s memory would fill up

Shouldn't be a problem, these resources should get freed when they will no longer be used.

It would not be easy to discriminate when training_step_end has been extended or not if I keep returning None?

That's right.

Would it help returning an empty dict?

Whatever fits you!

Closing for now, feel free to keep discussing 👍

jzazo · 2021-09-21T15:27:13Z

No, the proposed solution of filtering the output worked in my run. Thanks for your help!

jzazo added bug Something isn't working help wanted Open to be worked on labels Sep 20, 2021

carmocca added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Sep 21, 2021

carmocca closed this as completed Sep 21, 2021

carmocca added working as intended Working as intended and removed waiting on author Waiting on user action, correction, or update labels Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output from `validation_step_end` not propagated to `on_validation_batch_end` #9608

Output from `validation_step_end` not propagated to `on_validation_batch_end` #9608

jzazo commented Sep 20, 2021

jzazo commented Sep 20, 2021 •

edited

Loading

tchaton commented Sep 20, 2021

carmocca commented Sep 21, 2021

carmocca commented Sep 21, 2021

jzazo commented Sep 21, 2021

carmocca commented Sep 21, 2021

jzazo commented Sep 21, 2021

Output from validation_step_end not propagated to on_validation_batch_end #9608

Output from validation_step_end not propagated to on_validation_batch_end #9608

Comments

jzazo commented Sep 20, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

jzazo commented Sep 20, 2021 • edited Loading

tchaton commented Sep 20, 2021

carmocca commented Sep 21, 2021

carmocca commented Sep 21, 2021

jzazo commented Sep 21, 2021

carmocca commented Sep 21, 2021

jzazo commented Sep 21, 2021

Output from `validation_step_end` not propagated to `on_validation_batch_end` #9608

Output from `validation_step_end` not propagated to `on_validation_batch_end` #9608

jzazo commented Sep 20, 2021 •

edited

Loading