-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[NeMo-UX] checkpointing improvements (#10241)
* save model weights and artifacts to separate directories Signed-off-by: ashors1 <ashors@nvidia.com> * add save_artifacts_on_train_end Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * do not save optimizer states in final checkpoint Signed-off-by: ashors1 <ashors@nvidia.com> * WIP support for saving only last k optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * minor cleanup Signed-off-by: ashors1 <ashors@nvidia.com> * Revert support for saving last k optimizer states. This will be addressed in a subsequent PR. * use storage_options to determine when to skip saving optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix variable names, make checkpoint load work when optimizer states don't exist in the checkpoint Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * FSDP updates, provide option to save optimizer states on_train_end Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * simplify implementation, remove save_best_model option Signed-off-by: ashors1 <ashors@nvidia.com> * update default value of ckpt_include_optimizer for fsdp Signed-off-by: ashors1 <ashors@nvidia.com> * remove unused imports Signed-off-by: ashors1 <ashors@nvidia.com> * remove unused import Signed-off-by: ashors1 <ashors@nvidia.com> * cleanup Signed-off-by: ashors1 <ashors@nvidia.com> * make storage_options optional again Signed-off-by: ashors1 <ashors@nvidia.com> * fix failing tests Signed-off-by: ashors1 <ashors@nvidia.com> * address some comments Signed-off-by: ashors1 <ashors@nvidia.com> * use save_weights_only to determine whether to save optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * add some comments Signed-off-by: ashors1 <ashors@nvidia.com> * fix tests Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fixes Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unnecessary line Signed-off-by: ashors1 <ashors@nvidia.com> --------- Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
- Loading branch information
Showing
9 changed files
with
93 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters