-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeds up copying of necessary artifact files with SaveRestoreConnector #9682
Conversation
Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks safe to change. Once tests pass, can merge
I'm looking into the test failures. Very bizarre since running the complete suite locally hangs and I cannot reproduce the failures by running the failed cases individually |
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
I figured out the test failure issue. It turned out in my unit test I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good !
…or (#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
…or (NVIDIA#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
…or (#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…or (NVIDIA#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
…or (#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
…or (NVIDIA#9682) * Speeds up copying of neccesary artifact files with SaveRestoreConnector Previously, the SaveRestoreConnector would copy and untar entire checkpoints just to copy out a tokenizer. For models in the >100GB, this led to timeouts since only rank=0 did this work, while other ranks moved on and waited at an all-gather barrier (observed NCCL timeout at 10min). Signed-off-by: Terry Kong <terryk@nvidia.com> * cleanup Signed-off-by: Terry Kong <terryk@nvidia.com> * black formatting Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> * restoring logic to previous tempdir logic Signed-off-by: Terry Kong <terryk@nvidia.com> * nlp overrides too Signed-off-by: Terry Kong <terryk@nvidia.com> * respect return_config Signed-off-by: Terry Kong <terryk@nvidia.com> * some unit tests Signed-off-by: Terry Kong <terryk@nvidia.com> * nodbg Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> * correct typing Signed-off-by: Terry Kong <terryk@nvidia.com> * Fixes directory issue Signed-off-by: Terry Kong <terryk@nvidia.com> * Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: terrykong <terrykong@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
What does this PR do ?
Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).
Revision of #9299 that keeps most of the core logic surrounding tempdir creation