Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support synchronous saving and loading in CheckpointManager #5693

Merged
merged 7 commits into from
Oct 13, 2023

Conversation

jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Oct 10, 2023

This PR adds the initial functionality for CheckpointManager to manage synchronously taking checkpoints, restoring checkpoints, and managing how many checkpoints it tracks.

from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE

# TODO(jonbolin): Import path will change
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path will change when the API becomes public in the upstream. @alanwaketan @yeounoh do you have any thoughts on how to handle this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.

test/spmd/test_xla_distributed_checkpoint.py Show resolved Hide resolved
from torch.distributed.checkpoint.metadata import STATE_DICT_TYPE

# TODO(jonbolin): Import path will change
from torch.distributed.checkpoint._fsspec_filesystem import FsspecReader, FsspecWriter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. The upstream test will break our CI in the upstream, and then we can have a companion change to fix it.

torch_xla/experimental/distributed_checkpoint/manager.py Outdated Show resolved Hide resolved
torch_xla/experimental/distributed_checkpoint/manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377
Copy link
Collaborator Author

jonb377 commented Oct 12, 2023

Thanks @yeounoh and @alanwaketan for the review! I'll merge after TPU CI.

@jonb377 jonb377 merged commit 2a45d0d into master Oct 13, 2023
19 checks passed
@jonb377 jonb377 deleted the jonbolin/chkpt-manager branch October 13, 2023 00:14
zpcore pushed a commit that referenced this pull request Oct 19, 2023
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
…5693)

* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Support synchronous saving and loading in CheckpointManager

* Use 0 to indicate no upper bound

* Don't track async_queue_size

* Cache tracked steps locally

* Track creation time in metadata

* Rename save_period to save_interval

* Fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants