Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output from validation_step_end not propagated to on_validation_batch_end #9608

Closed
jzazo opened this issue Sep 20, 2021 · 7 comments
Closed
Labels
bug Something isn't working working as intended Working as intended

Comments

@jzazo
Copy link

jzazo commented Sep 20, 2021

🐛 Bug

Output from validation_step_end not propagated to on_validation_batch_end; rather it receives from validation_step. Same happens for the test counterpart.

The documentation states outputs (Union[Tensor, Dict[str, Any], None]) – The outputs of validation_step_end(validation_step(x)), but it is not set like that.

To Reproduce

MWE with boring model: https://colab.research.google.com/drive/1xIq7dj6aY6ZPSEPlBF0dPGI87wwkLepL?usp=sharing

Expected behavior

The output should be that from validation_step_end, not from validation_step.

Environment

  • PyTorch Lightning Version: 1.4.5
  • PyTorch Version: 1.9.0
  • Python version: 3.8.5
  • OS: Ubuntu 20.04
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: Tesla K80
  • How you installed PyTorch: pipenv
@jzazo jzazo added bug Something isn't working help wanted Open to be worked on labels Sep 20, 2021
@jzazo
Copy link
Author

jzazo commented Sep 20, 2021

The problem with this bug is that I was using validation_step_end to clear the output of validation steps so that the GPU memory footprint during a validation epoch is smaller (so to avoid getting cuda OOM during validation).

@tchaton
Copy link
Contributor

tchaton commented Sep 20, 2021

@carmocca

@carmocca
Copy link
Contributor

git bisect points to this PR breaking the feature: #7826

@carmocca
Copy link
Contributor

The problem is that we check None as a sentinel value to know whether the hook had no output, but since your example is explicitly returning None, we pass what we had previously.

https://github.com/PyTorchLightning/pytorch-lightning/blob/61b4e33d949fa8f0e8b11ae196271368400445f1/pytorch_lightning/trainer/trainer.py#L1343

If you had:

    def validation_step_end(self, *args, **kwargs):
        return "something_that_is_not_None"

Then you'll see it working properly.

Can you elaborate on why you need to return None?

@carmocca carmocca added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Sep 21, 2021
@jzazo
Copy link
Author

jzazo commented Sep 21, 2021

I can return something else, but I was afraid that the GPU´s memory would fill up, so I was thinking that returning None would be better. If I had an epoch of 1M steps I am not sure what the impact would be on gpu memory.

It is true that during training_step_end I am returning the loss and the memory does not fill up, so if I did the same during validation it could work. Trying that now.

Because of what you describe it looks like a hard fix, doesn't it? It would not be easy to discriminate when training_step_end has been extended or not if I keep returning None? Would it help returning an empty dict?

@carmocca
Copy link
Contributor

I was afraid that the GPU´s memory would fill up

Shouldn't be a problem, these resources should get freed when they will no longer be used.

It would not be easy to discriminate when training_step_end has been extended or not if I keep returning None?

That's right.

Would it help returning an empty dict?

Whatever fits you!

Closing for now, feel free to keep discussing 👍

@carmocca carmocca added working as intended Working as intended and removed waiting on author Waiting on user action, correction, or update labels Sep 21, 2021
@jzazo
Copy link
Author

jzazo commented Sep 21, 2021

No, the proposed solution of filtering the output worked in my run. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working working as intended Working as intended
Projects
None yet
Development

No branches or pull requests

3 participants