-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload #3376
Conversation
… into checkpoint_saver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, thanks Ning!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces a lot of code debt by copying and pasting a lot of the logic from remote_uploader
into checkpoint_saver with the only difference being that the files being uploaded are symlinks. Since remote symlinks are just text files we need to find a way to upload symlinks using the remote_uploader before we can merge this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey clever solution! I have a bunch of comments/questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock
…ckpoint files upload (mosaicml#3376) * a * a * a * a * a * a * a * a * fix test * a * a * a * a * fix unit test * a * a * a * a * a * fix 2gpu unit test * a * a * a * a * fix doctest * a * fix test and lint * up * a * a * a * a * a * a * a * a * address comments * a * a * a * a * rerun test * add logging * remove debug comments * comments * a * cleanup * a * linter * lint * Update composer/callbacks/checkpoint_saver.py Co-authored-by: Evan Racah <evan.racah@databricks.com> * commenst * a * fix test * fix test * comments * a --------- Co-authored-by: Evan Racah <evan.racah@databricks.com>
…ckpoint files upload (#3376) * a * a * a * a * a * a * a * a * fix test * a * a * a * a * fix unit test * a * a * a * a * a * fix 2gpu unit test * a * a * a * a * fix doctest * a * fix test and lint * up * a * a * a * a * a * a * a * a * address comments * a * a * a * a * rerun test * add logging * remove debug comments * comments * a * cleanup * a * linter * lint * Update composer/callbacks/checkpoint_saver.py Co-authored-by: Evan Racah <evan.racah@databricks.com> * commenst * a * fix test * fix test * comments * a --------- Co-authored-by: Evan Racah <evan.racah@databricks.com>
What does this PR do?
Fix the symlink issues.
How?
[updated]: in the checkpoint saver, on rank-0 which saves the symlink, it all_gather the remote checkpoint file names, and start a new process to check if those remote files finish uploading by calling
object_store.get_object_size
. It only upload symlink file once all the remote file finish uploading. This way:Unit test
Integration test
2-nodes OCI:
save: test-uploader-0Tkv9O
autoresume: test-uploader-yac88U
2-nodes mflow:
save: l38bi-full-sweep-train-bb-1-0e-6-5-VMY2Xo
load: l38bi-full-sweep-train-bb-1-0e-6-5-TyRcPi
Daily test:
https://github.com/mosaicml/composer/actions/runs/9700144963
composer regression test:
https://github.com/databricks-mosaic/regression-testing/actions/runs/9700161085
Perf test (100 batches with 9 batch save interval. The training time varies because of unstable uploading speed, but just want to make sure test didn't regress training time)
64 gpu test: 77b-bs1024-g512-res2-f60-37gSXK time: 26 minutes
64 GPU baseline: 77b-bs1024-g512-res2-f60-2VZPxB time: more than 40 minutes because rank 29 uploading delay
In case 1 rank upload fails, it won't hang:
test-uploader-MXcoEp