Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async checkpointing through CheckpointManager #5697

Merged
merged 4 commits into from
Oct 13, 2023

Conversation

jonb377
Copy link
Collaborator

@jonb377 jonb377 commented Oct 10, 2023

Support asynchronous checkpointing through the CheckpointManager interface. This will move the state_dict to CPU before starting the checkpoint, which unblocks the calling thread.

This PR depends on #5693 for synchronous checkpointing functionality.

Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some questions.

@jonb377 jonb377 force-pushed the jonbolin/async-chkpt branch 3 times, most recently from b9e1952 to 273ef2f Compare October 12, 2023 02:12
Base automatically changed from jonbolin/chkpt-manager to master October 13, 2023 00:14
Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jonb377 jonb377 merged commit 06ba6e9 into master Oct 13, 2023
19 checks passed
@jonb377 jonb377 deleted the jonbolin/async-chkpt branch October 13, 2023 20:21
zpcore pushed a commit that referenced this pull request Oct 19, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
@jonb377
Copy link
Collaborator Author

jonb377 commented Nov 10, 2023

cc @wz337

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Support async checkpointing through CheckpointManager

* Allow threads to exit when CheckpointManager is freed

* Use rank from tracked process group

* Add TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants