You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support time-based checkpointing in model checkpoint callback
Motivation
After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).
Pitch
This would entail:
adding a new optional argument time_interval to the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close.
Inside of on_train_batch_end: we add the following check
now = time.monotonic()
time_interval = self.time_interval
prev_time_check = self._prev_time_check
skip_time = (
time_interval is None
or prev_time_check is None
or (now - prev_time_check) < time_interval.total_seconds()
)
if skip_batch and skip_time:
return
if not skip_time:
self._prev_time_check = now
... # commence with saving checkpoint
note we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.
The text was updated successfully, but these errors were encountered:
🚀 Feature
Support time-based checkpointing in model checkpoint callback
Motivation
After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).
Pitch
This would entail:
time_interval
to the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close.on_train_batch_end
: we add the following checknote we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.
The text was updated successfully, but these errors were encountered: