Support time-based checkpointing trigger #6286

ananthsub · 2021-03-02T06:17:20Z

🚀 Feature

Support time-based checkpointing in model checkpoint callback

Motivation

After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).

Pitch

This would entail:

adding a new optional argument time_interval to the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close.
Inside of on_train_batch_end: we add the following check

now = time.monotonic()
time_interval = self.time_interval
prev_time_check = self._prev_time_check
skip_time = (
    time_interval is None
    or prev_time_check is None
    or (now - prev_time_check) < time_interval.total_seconds()
)
if skip_batch and skip_time:
    return
if not skip_time:
    self._prev_time_check = now
...  # commence with saving checkpoint

note we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.

The text was updated successfully, but these errors were encountered:

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 2, 2021

ananthsub self-assigned this Mar 2, 2021

ananthsub added this to the 1.2.x milestone Mar 2, 2021

ananthsub mentioned this issue Mar 3, 2021

Support ModelCheckpoint saving at step intervals and fractional epoch intervals #6333

Closed

ananthsub mentioned this issue Mar 13, 2021

[RFC] Create a ModelCheckpointBase callback #6504

Closed

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

edenlightning removed this from the v1.3 milestone Apr 27, 2021

edenlightning added this to the v1.4 milestone May 9, 2021

ananthsub added the callback label May 12, 2021

This was referenced May 13, 2021

[feat] Support time-based checkpointing during training #7515

Merged

[feat] Support time-based checkpointing during training #7613

Closed

awaelchli closed this as completed in #7515 May 19, 2021

ananthsub mentioned this issue May 20, 2021

Checkpointing by time interval #7621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support time-based checkpointing trigger #6286

Support time-based checkpointing trigger #6286

ananthsub commented Mar 2, 2021

Support time-based checkpointing trigger #6286

Support time-based checkpointing trigger #6286

Comments

ananthsub commented Mar 2, 2021

🚀 Feature

Motivation

Pitch