-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support synchronous saving and loading in CheckpointManager #5693
Conversation
from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE | ||
|
||
# TODO(jonbolin): Import path will change | ||
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import path will change when the API becomes public in the upstream. @alanwaketan @yeounoh do you have any thoughts on how to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.
from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE | ||
|
||
# TODO(jonbolin): Import path will change | ||
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.
152c9bf
to
4430877
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @yeounoh and @alanwaketan for the review! I'll merge after TPU CI. |
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
…5693) * Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
* Support synchronous saving and loading in CheckpointManager * Use 0 to indicate no upper bound * Don't track async_queue_size * Cache tracked steps locally * Track creation time in metadata * Rename save_period to save_interval * Fix tests
This PR adds the initial functionality for CheckpointManager to manage synchronously taking checkpoints, restoring checkpoints, and managing how many checkpoints it tracks.