-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Deprecate the _epoch_end hooks #8731
Comments
one thing that will emerge is the hook call order between callbacks and the lightning module. logging something in the module's so this will be related to #8506 |
Even though I agree with your arguments, people just love "framework magic". Removing this can make sense engineering-wise and design-wise but it can break user trust.
This does look simple but does not consider the cases where multiple optimizers are available. Basically, all the code we would remove would need to be implemented by the users who need it, depending on how complex is their training step. If we were in a 0.x version I'd entirely agree with you, but we have to consider how used these hooks are and the degree of headaches that this change might impose on users Have you considered making If |
I think that's also the point. If we want to support more flavors of training steps, the logic for handling these outputs in the framework gets more and more complex, when it could sit locally in the user's training step. tracking the outputs per optimizer idx and dataloader idx in automatic optimization also shouldn't be significantly more work given these are accessible from the training step |
API confusion aside, is the memory problem really an issue? I thought we are not tracking outputs if the hook is not overridden. In that sense, we already have the opt-in choice. This is the part where I don't get the full argument for removing it. Regardless, I believe it would be interesting to see a lightning module rnn manual-implementation without the built-in TBTT. If we have that, we can examine the amount of boilerplate required and get a better picture of what impact this deprecation has. Perhaps I could help here. I could try to contribute an example here. |
@awaelchi The memory issue is the biggest risk. Though it's opt in, the current hook order makes it very dangerous:
if the user needs to log a value that's used in the callback's on train epoch end, then they currently are forced to do so in training_epoch_end (assuming they can't use training_step with self.log + on_epoch=True). because they implement this, right away this incurs a major performance hit (at best) or training failure (because of OOMs) I believe your second comment is meant for #8732 |
After the logger connector re-design, if you have 2 callbacks So technically you are not forced to implement |
Would this affect logging of torch_metrics, i.e., automatically calling metric.compute() at the end of each epoch? |
@carmocca the scenario I imagine is the LightningModule logs value that's required for callbacks @anhnht3 no this would not affect logging of torchmetrics |
What is the status of this issue? I saw it was added to a sprint in September |
It's waiting for a final decision but It's not likely to be approved given the discussions we've had. |
Another issue from transforming the return data from steps in a way users don't expect: #9968 |
We are auditing the Lightning components and APIs to assess opportunities for improvements:
#7740
https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#
Lightning has had some recent issues filed around these hooks:
training_epoch_end
validation_epoch_end
test_epoch_end
predict_epoch_end
Examples:
training_epoch_end
hook in case of multiple optimizers or TBPTT #9737These hooks exist in order to accumulate the step-level outputs during the epoch for post-processing at the end of the epoch. However, we do not need these to be on the core LightningModule interface. Users can easily track outputs directly inside their implemented modules
Asking users to do this tracking offers major benefits:
training_epoch_end
vson_train_epoch_end
? this can improve the onboarding experience (one less class of hooks to learn about, only 1 way to do things).training_epoch_end
and don't useoutputs
, the trainer needlessly accumulates results, which wastes memory and risks OOMing. This is slowdown is not clearly visible to the user either, unless training completely fails, at which point this is a bad user experience.Cons:
Proposal
training_epoch_end
,validation_epoch_end
, andtest_epoch_end
in v1.5This is how easily users can implement this in their LightningModule with the existing hooks:
so we're talking about 3 lines of code here per train/val/test/predict. I argue this is so minimal compared to the amount of logic that usually goes into post-processing the outputs anyways.
@PyTorchLightning/core-contributors
Originally posted by @ananthsub in #8690
The text was updated successfully, but these errors were encountered: