Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeds up copying of necessary artifact files with SaveRestoreConnector #9682

Merged
merged 16 commits into from
Jul 17, 2024

Conversation

terrykong
Copy link
Collaborator

What does this PR do ?

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Revision of #9299 that keeps most of the core logic surrounding tempdir creation

terrykong and others added 9 commits July 5, 2024 17:37
Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-actions github-actions bot added core Changes to NeMo Core NLP labels Jul 11, 2024
terrykong and others added 2 commits July 11, 2024 03:13
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
titu1994
titu1994 previously approved these changes Jul 11, 2024
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks safe to change. Once tests pass, can merge

@terrykong
Copy link
Collaborator Author

I'm looking into the test failures. Very bizarre since running the complete suite locally hangs and I cannot reproduce the failures by running the failed cases individually

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
@terrykong
Copy link
Collaborator Author

terrykong commented Jul 15, 2024

I figured out the test failure issue. It turned out in my unit test I os.chdir into a temporary directory and don't os.chdir back into the CWD. This causes failures for tests in different scopes. So I wasn't seeing the issue when running just the failed tests by themselves, but instead saw it only when running the entire suite together.

@terrykong terrykong requested a review from titu1994 July 15, 2024 22:16
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !

@terrykong terrykong merged commit 649ad1f into main Jul 17, 2024
210 checks passed
@terrykong terrykong deleted the terryk/efficient-ckpt-fix-simpler branch July 17, 2024 18:42
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
…or (#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Jul 24, 2024
…or (NVIDIA#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
akoumpa pushed a commit that referenced this pull request Jul 25, 2024
…or (#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
…or (NVIDIA#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
…or (#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
…or (NVIDIA#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants