Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support time-based checkpointing trigger #6286

Closed
ananthsub opened this issue Mar 2, 2021 · 0 comments · Fixed by #7515
Closed

Support time-based checkpointing trigger #6286

ananthsub opened this issue Mar 2, 2021 · 0 comments · Fixed by #7515
Assignees
Labels
callback feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@ananthsub
Copy link
Contributor

🚀 Feature

Support time-based checkpointing in model checkpoint callback

Motivation

After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).

Pitch

This would entail:

  • adding a new optional argument time_interval to the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close.
  • Inside of on_train_batch_end: we add the following check
now = time.monotonic()
time_interval = self.time_interval
prev_time_check = self._prev_time_check
skip_time = (
    time_interval is None
    or prev_time_check is None
    or (now - prev_time_check) < time_interval.total_seconds()
)
if skip_batch and skip_time:
    return
if not skip_time:
    self._prev_time_check = now
...  # commence with saving checkpoint

note we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
3 participants