Trainer should overwrite max_epoch when max_time is specified #8210

kevinNejad · 2021-06-30T00:30:54Z

🚀 Feature

Currently, the Trainer terminates at max_epoch (1000) even if the max_time is specified. Even if min_epoch is set to be greater than 1000, the Trainer should either overwrite the max_epoch value, or the documentation should state that min_epoch is only applicable if early_stopping is enabled. It could also throw warning when min_epoch is set while early_stopping is disabled.

Motivation

To continue training for X amount of time (e.g 7 days) when you don't know the maximum number of epochs and no early_stopping/criteria is specified.

Pitch

I think overwriting max_epoch with values of min_epoch or max_time should be reasonable when they are larger than max_epoch. I wanted to train a big model for 2 days, but after 1 day if was stopped despite setting the max_time and min_epoch parameters.
If you are concerned with the duration of training, or you are interested in the asymptotic behavior of your model in infinite time horizon, it makes sense to stop training when it reaches a max_time regardless of max_epoch.

Alternatives

The alternative is to set a very large max_epoch to ensure the training won't stop until max_time is reached.

ananthsub · 2021-06-30T04:07:36Z

@awaelchli - should we pass max_time to the fit loop constructor and skip setting the default of max_epochs if max_time is not None?

https://github.com/PyTorchLightning/pytorch-lightning/blob/4af8eff0a13309b2e030c423db91156b1e869fac/pytorch_lightning/loops/fit_loop.py#L51

This also raises what the default should be if no stopping condition is passed to the trainer (e.g. no max steps / max epochs / max time) - should this be an infinite while loop and require an external signal to kill the job? Is it conceivable someone wants to train forever, or for an absurdly long amount of time where they neither know the number of epochs or time it takes for convergence? Should we allow this only if there's an early stopping callback attached to the trainer?

Even if min_epoch is set to be greater than 1000, the Trainer should either overwrite the max_epoch value, or the documentation should state that min_epoch is only applicable if early_stopping is enabled. It could also throw warning when min_epoch is set while early_stopping is disabled.

Would this be needed if we addressed the max time issue? Overwriting the arguments to me doesn't feel right. If a user is specifying min_epochs > max_epochs, we should raise a misconfiguration exception.

Fixes Lightning-AI#8210

kevinNejad added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 30, 2021

kevinNejad changed the title ~~Trainer overwrite max_epoch when max_time is specified~~ Trainer should overwrite max_epoch when max_time is specified Jun 30, 2021

awaelchli self-assigned this Jul 9, 2021

ananthsub mentioned this issue Aug 9, 2021

Allow disabling automatic stopping during fitting #8818

Closed

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Aug 24, 2021

Check max_time when defaulting to min/max epochs

bf11605

Fixes Lightning-AI#8210

ananthsub mentioned this issue Aug 24, 2021

Check max_time when setting defaults for min/max epochs #9072

Merged

12 tasks

tchaton assigned ananthsub and unassigned awaelchli Aug 26, 2021

tchaton closed this as completed in #9072 Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer should overwrite max_epoch when max_time is specified #8210

Trainer should overwrite max_epoch when max_time is specified #8210

kevinNejad commented Jun 30, 2021

ananthsub commented Jun 30, 2021

Trainer should overwrite max_epoch when max_time is specified #8210

Trainer should overwrite max_epoch when max_time is specified #8210

Comments

kevinNejad commented Jun 30, 2021

🚀 Feature

Motivation

Pitch

Alternatives

ananthsub commented Jun 30, 2021