Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload #3376

Merged
merged 66 commits into from
Jul 8, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Jun 6, 2024

What does this PR do?

Fix the symlink issues.
How?
[updated]: in the checkpoint saver, on rank-0 which saves the symlink, it all_gather the remote checkpoint file names, and start a new process to check if those remote files finish uploading by calling object_store.get_object_size. It only upload symlink file once all the remote file finish uploading. This way:

  1. it won't block the training, since it's a separate process
  2. it almost has no delay to upload the symlink file (it will sleep 30s if remote file is not there yet)

Unit test

Integration test

2-nodes OCI:

save: test-uploader-0Tkv9O
autoresume: test-uploader-yac88U

2-nodes mflow:

save: l38bi-full-sweep-train-bb-1-0e-6-5-VMY2Xo
load: l38bi-full-sweep-train-bb-1-0e-6-5-TyRcPi

Daily test:

https://github.com/mosaicml/composer/actions/runs/9700144963

composer regression test:

https://github.com/databricks-mosaic/regression-testing/actions/runs/9700161085

Perf test (100 batches with 9 batch save interval. The training time varies because of unstable uploading speed, but just want to make sure test didn't regress training time)

64 gpu test: 77b-bs1024-g512-res2-f60-37gSXK time: 26 minutes
64 GPU baseline: 77b-bs1024-g512-res2-f60-2VZPxB time: more than 40 minutes because rank 29 uploading delay

In case 1 rank upload fails, it won't hang:

test-uploader-MXcoEp

@bigning bigning changed the title [WIP] Checkpoint saver [Checkpoint] Fix symlink issue where symlink updated before checkpoint files upload Jun 13, 2024
@bigning bigning changed the title [Checkpoint] Fix symlink issue where symlink updated before checkpoint files upload [Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload Jun 13, 2024
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, thanks Ning!

composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/utils/file_helpers.py Outdated Show resolved Hide resolved
composer/utils/file_helpers.py Show resolved Hide resolved
composer/utils/object_store/utils.py Outdated Show resolved Hide resolved
tests/utils/test_remote_uploader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a lot of code debt by copying and pasting a lot of the logic from remote_uploader into checkpoint_saver with the only difference being that the files being uploaded are symlinks. Since remote symlinks are just text files we need to find a way to upload symlinks using the remote_uploader before we can merge this in.

composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
@bigning bigning requested a review from eracah June 26, 2024 19:16
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey clever solution! I have a bunch of comments/questions

composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
bigning and others added 3 commits June 28, 2024 10:35
Co-authored-by: Evan Racah <evan.racah@databricks.com>
@bigning bigning requested a review from eracah June 28, 2024 18:13
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
composer/utils/remote_uploader.py Outdated Show resolved Hide resolved
@bigning bigning requested a review from eracah July 2, 2024 05:19
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bigning bigning requested a review from a team July 8, 2024 16:06
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock

@bigning bigning merged commit 84fa1fe into dev Jul 8, 2024
14 checks passed
@bigning bigning deleted the checkpoint_saver branch July 8, 2024 16:48
mvpatel2000 pushed a commit to mvpatel2000/composer that referenced this pull request Jul 21, 2024
…ckpoint files upload (mosaicml#3376)

* a

* a

* a

* a

* a

* a

* a

* a

* fix test

* a

* a

* a

* a

* fix unit test

* a

* a

* a

* a

* a

* fix 2gpu unit test

* a

* a

* a

* a

* fix doctest

* a

* fix test and lint

* up

* a

* a

* a

* a

* a

* a

* a

* a

* address comments

* a

* a

* a

* a

* rerun test

* add logging

* remove debug comments

* comments

* a

* cleanup

* a

* linter

* lint

* Update composer/callbacks/checkpoint_saver.py

Co-authored-by: Evan Racah <evan.racah@databricks.com>

* commenst

* a

* fix test

* fix test

* comments

* a

---------

Co-authored-by: Evan Racah <evan.racah@databricks.com>
mvpatel2000 pushed a commit that referenced this pull request Jul 21, 2024
…ckpoint files upload (#3376)

* a

* a

* a

* a

* a

* a

* a

* a

* fix test

* a

* a

* a

* a

* fix unit test

* a

* a

* a

* a

* a

* fix 2gpu unit test

* a

* a

* a

* a

* fix doctest

* a

* fix test and lint

* up

* a

* a

* a

* a

* a

* a

* a

* a

* address comments

* a

* a

* a

* a

* rerun test

* add logging

* remove debug comments

* comments

* a

* cleanup

* a

* linter

* lint

* Update composer/callbacks/checkpoint_saver.py

Co-authored-by: Evan Racah <evan.racah@databricks.com>

* commenst

* a

* fix test

* fix test

* comments

* a

---------

Co-authored-by: Evan Racah <evan.racah@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants