Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer should overwrite max_epoch when max_time is specified #8210

Closed
kevinNejad opened this issue Jun 30, 2021 · 1 comment · Fixed by #9072
Closed

Trainer should overwrite max_epoch when max_time is specified #8210

kevinNejad opened this issue Jun 30, 2021 · 1 comment · Fixed by #9072
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@kevinNejad
Copy link

🚀 Feature

Currently, the Trainer terminates at max_epoch (1000) even if the max_time is specified. Even if min_epoch is set to be greater than 1000, the Trainer should either overwrite the max_epoch value, or the documentation should state that min_epoch is only applicable if early_stopping is enabled. It could also throw warning when min_epoch is set while early_stopping is disabled.

Motivation

To continue training for X amount of time (e.g 7 days) when you don't know the maximum number of epochs and no early_stopping/criteria is specified.

Pitch

I think overwriting max_epoch with values of min_epoch or max_time should be reasonable when they are larger than max_epoch. I wanted to train a big model for 2 days, but after 1 day if was stopped despite setting the max_time and min_epoch parameters.
If you are concerned with the duration of training, or you are interested in the asymptotic behavior of your model in infinite time horizon, it makes sense to stop training when it reaches a max_time regardless of max_epoch.

Alternatives

The alternative is to set a very large max_epoch to ensure the training won't stop until max_time is reached.

@kevinNejad kevinNejad added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 30, 2021
@kevinNejad kevinNejad changed the title Trainer overwrite max_epoch when max_time is specified Trainer should overwrite max_epoch when max_time is specified Jun 30, 2021
@ananthsub
Copy link
Contributor

@awaelchli - should we pass max_time to the fit loop constructor and skip setting the default of max_epochs if max_time is not None?

https://github.com/PyTorchLightning/pytorch-lightning/blob/4af8eff0a13309b2e030c423db91156b1e869fac/pytorch_lightning/loops/fit_loop.py#L51

This also raises what the default should be if no stopping condition is passed to the trainer (e.g. no max steps / max epochs / max time) - should this be an infinite while loop and require an external signal to kill the job? Is it conceivable someone wants to train forever, or for an absurdly long amount of time where they neither know the number of epochs or time it takes for convergence? Should we allow this only if there's an early stopping callback attached to the trainer?

Even if min_epoch is set to be greater than 1000, the Trainer should either overwrite the max_epoch value, or the documentation should state that min_epoch is only applicable if early_stopping is enabled. It could also throw warning when min_epoch is set while early_stopping is disabled.

Would this be needed if we addressed the max time issue? Overwriting the arguments to me doesn't feel right. If a user is specifying min_epochs > max_epochs, we should raise a misconfiguration exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants