Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing by time interval #7621

Closed
hudeven opened this issue May 20, 2021 · 2 comments
Closed

Checkpointing by time interval #7621

hudeven opened this issue May 20, 2021 · 2 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@hudeven
Copy link

hudeven commented May 20, 2021

🚀 Feature

for ModelCheckpoint callback, support time_period to save checkpoint every X second/min/hour.

Motivation

It takes days to train large models and sometimes it crashes in the middle of epoch due to infra issue. Besides per epoch checkpointing, I hope to checkpoint in a fine grained way. Currently, ModelCheckpoint callback supports "every_n_train_steps", however, the time for each train step varies depending on the configuration of batch_size, accumulate grad batch etc.

Pitch

It would be better if we could support checkpoint by time period(optional to run validation, mostly for resuming training from failure), along with checkpoint by epoch/steps with validation

Alternatives

I have to start a run to get training time for a step and find a proper number for "every_n_train_steps".

Additional context

cc: @shuyingsunshine21 @ananthsub

@hudeven hudeven added feature Is an improvement or enhancement help wanted Open to be worked on labels May 20, 2021
@ananthsub
Copy link
Contributor

ananthsub commented May 20, 2021

@hudeven this merged today: #7515
Duplicate of #6286
Closing this issue out :)

@hudeven
Copy link
Author

hudeven commented May 20, 2021

wow, super fast commenting. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

2 participants