-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to resume from specific path in AutoResume #10373
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Hemil Desai <hemild@nvidia.com>
ashors1
approved these changes
Sep 6, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
ssh-meister
added a commit
that referenced
this pull request
Sep 9, 2024
* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3396356 ! (#10353) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> * [NeMo-UX] Turn on mcore performance optimizations (#10209) * expose TP overlap Signed-off-by: Jieming Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * add tp overlap recipes Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * turn on pipeline parallel overlap Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * refactor Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * Update base.py Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> * Update megatron_parallel.py Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> * remove env var Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * add optimization config Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * fix typo Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * refactor into megatron parallel setup Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * refactor Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * fix config ordering, add wgrad deferral Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * cleanup Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * use config Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * clean Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * enable wgrad defferal Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * add grad bucket size Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * move everthing into a callback Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * cleanup Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * fix imports Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * move userbuffer init Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * cleanup Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * fix VP Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * address comments Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * add gradient accum guard Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * Update base.py Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> * address comments Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * address comments Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> * Apply isort and black reformatting Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> --------- Signed-off-by: Jieming Zhang <jiemingz@nvidia.com> Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jieming Zhang <jiemingz@nvidia.com> Co-authored-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> * [NeMo-UX] checkpointing improvements (#10241) * save model weights and artifacts to separate directories Signed-off-by: ashors1 <ashors@nvidia.com> * add save_artifacts_on_train_end Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * do not save optimizer states in final checkpoint Signed-off-by: ashors1 <ashors@nvidia.com> * WIP support for saving only last k optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * minor cleanup Signed-off-by: ashors1 <ashors@nvidia.com> * Revert support for saving last k optimizer states. This will be addressed in a subsequent PR. * use storage_options to determine when to skip saving optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fix variable names, make checkpoint load work when optimizer states don't exist in the checkpoint Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * FSDP updates, provide option to save optimizer states on_train_end Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * simplify implementation, remove save_best_model option Signed-off-by: ashors1 <ashors@nvidia.com> * update default value of ckpt_include_optimizer for fsdp Signed-off-by: ashors1 <ashors@nvidia.com> * remove unused imports Signed-off-by: ashors1 <ashors@nvidia.com> * remove unused import Signed-off-by: ashors1 <ashors@nvidia.com> * cleanup Signed-off-by: ashors1 <ashors@nvidia.com> * make storage_options optional again Signed-off-by: ashors1 <ashors@nvidia.com> * fix failing tests Signed-off-by: ashors1 <ashors@nvidia.com> * address some comments Signed-off-by: ashors1 <ashors@nvidia.com> * use save_weights_only to determine whether to save optimizer states Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * add some comments Signed-off-by: ashors1 <ashors@nvidia.com> * fix tests Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * fixes Signed-off-by: ashors1 <ashors@nvidia.com> * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unnecessary line Signed-off-by: ashors1 <ashors@nvidia.com> --------- Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> * [Nemo Unit Tests] Split CPU unit tests (#10365) * Split CPU unit tests * Split CPU unit tests * Fix:Run pytest in specific paths * Fix:Run pytest in specific paths * Fix:Run pytest in specific paths * ci: Fix checkout of secrets detector (#10381) * ci: Fix checkout of secrets detector Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * f Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * only log consumed samples during training (#10371) Signed-off-by: ashors1 <ashors@nvidia.com> * Alit/mamba 2 0 migration (#10338) * [NeMo-UX] Checkpointing fixes (#10376) * remove save_best_model from default logger Signed-off-by: ashors1 <ashors@nvidia.com> * fix broken checkpoint restore Signed-off-by: ashors1 <ashors@nvidia.com> * fix fsdp Signed-off-by: ashors1 <ashors@nvidia.com> * rename weights path to avoid confusion Signed-off-by: ashors1 <ashors@nvidia.com> * Revert "rename weights path to avoid confusion". We'll add this in a separate PR This reverts commit 72bae8b. --------- Signed-off-by: ashors1 <ashors@nvidia.com> * add auto configurator to NeMo (#10270) * add base configs Signed-off-by: dimapihtar <dpihtar@gmail.com> * add auto configurator functionality Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * add runner Signed-off-by: dimapihtar <dpihtar@gmail.com> * add end-to-end example for auto configurator Signed-off-by: dimapihtar <dpihtar@gmail.com> * add unit tests for auto configurator Signed-off-by: dimapihtar <dpihtar@gmail.com> * add GPT configs Signed-off-by: dimapihtar <dpihtar@gmail.com> * add GPT configs Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * switch to dataclass Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * switch to dataclass Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * fix dataclasses usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * remove unused imports Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove extra function Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix docstring style Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * take Config object as input for model Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * add nemotron support Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * remove search_config.py Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * move configs creation to Basic class Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * move to common basic class Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * rename main config Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove base configs for models Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * Apply isort and black reformatting Signed-off-by: artbataev <artbataev@users.noreply.github.com> * change auto conf functionality Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * fix docstring Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * remove unused imports Signed-off-by: dimapihtar <dpihtar@gmail.com> * add changes Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove activations_checkpoint_num_layers Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove gbs from config Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix logs Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * fix performance calculation Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix end-to-end example Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * fix model config Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * minor changes Signed-off-by: dimapihtar <dpihtar@gmail.com> * minor changes Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * fix unit tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * add README Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix README Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix README Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix readme Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix readme Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove extra arg Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove unused imports Signed-off-by: dimapihtar <dpihtar@gmail.com> * add nemo-run installation Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix unit tests Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix unit tests Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> * fix mixtraltopk (#10366) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> * ci: Fix release tag (#10367) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Akoumparouli/nemo ux tokenizer fix (#10351) * save tokenizer to disk Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Track Hf tokenizer assets Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * raise exception if dst file exists Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * minor Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove print Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add tokenizercontext Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Add TokenizerContext Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * restore tokenizer from separate dir Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update artifact __init__.py Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * TokenizerContext connector Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * bugix on_import_ckpt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rm code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Drop tokenizercontext Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * drop tokenizer load from tokenizercontext Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * undo Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * undo Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Move to util function Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * use save_hf_tokenizer_assets Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> * add tokenizer restoration in resume.py Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * bot fixes Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rm Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * wrap tokenizer restoration in try/catch Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * load_artifacts Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * param fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * more fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * lazy import tensorboard Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * move code out of file context manager Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Allow skippable artifacts Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> * rebase fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * checkpoint structure change update Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> * Add option to resume from specific path in AutoResume (#10373) * Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> * ci: Cleanup of release-freeze automation (#10392) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Toggle pre-release (#10394) * ci: Toggle pre-release Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * f Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Toggle pre-release (#10395) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Toggle pre-release (#10396) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Automate pre-release (#10397) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Akoumparouli/nemo ux validate dataset asset accessibility (#10309) * Add validate_dataset_asset_accessibility Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Add CI tests for validate_dataset_asset_accessibility Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix for zipped lists Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> * [🤠]: Howdy folks, let's bump NeMo `2.1.0rc0` ! (#10399) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com> * ci: Update baseline (#10400) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci(chore): Minor change (#10401) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Swap merge/cherry-pick order (#10389) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix release tag (#10402) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Ko3n1g/ci/fix release workflow 2 (#10403) * ci: Improve release workflow Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix cherry-picking Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Send Slack alert on failed cherry pick (#10404) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Allow concurrent docker system prune (#10405) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Use PAT for cherry-picking (#10406) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Alit/mamba ux cicd (#10370) * add mamba init * more ssm * add 370m * add hybrid * fix issue * integrate model and tokenizer config for ssm * add all mamba configs * modify state re pattern * revert gpt stuff * remove SSM class and training script * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> * remove faulty export * add script to test * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> * some recent fixes * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> * test script tp/pp1 * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> * add cicd * include MLM mamba dist ckpt commit * add license head and address more comments * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> * add guard * remove guard from TransformerConfig * update scripts * Apply isort and black reformatting Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> --------- Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Co-authored-by: Ali Taghibakhshi <ataghibakhsh@login-eos01.eos.clusters.nvidia.com> Co-authored-by: JRD971000 <JRD971000@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> * ci: Allow default token to write workflows (#10407) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: More permissions for cherry-pick automation (#10409) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Overhaul cherry-pick workflow (#10410) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Ignore failures on cherry-picking (#10411) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Minor change (#10412) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix cherry-pick config (#10413) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Minor change (#10414) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Minor change (#10415) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Remove dead code (#10416) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Ko3n1g/ci/test cherry picking 2 (#10417) * ci: Cherrypick continue on error Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Fix cherry pick branch Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Small test (#10419) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * ci: Small fix (#10420) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * [NeMo-UX] Integrating CLI (#10300) * Adding nemo-run to requirements Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Updating nemo-run entrypoint inside setup.py Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Remove nemo-run from requirements until we have a pypi package Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update entrypoint naming Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Setting up cli recipe for llama3-8b Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Move AutoTokenizer import inline for starcoder Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Move AutoTokenizer import inline for starcoder2 Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Use target for factories inside llama3_8b Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Update other recipes Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix some bugs in the recipes Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding some examples Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding repl example Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Starting to add a notebook example as well Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Fix wrong imports Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply isort and black reformatting Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> * Fix wrong imports Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Fix typo + add script with default executor Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Add nemo-run to Dockerfile.ci Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding copyright to recipes Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Adding guides to recipes dir Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Adding hatchling to Dockerfile.ci Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Move install to different line Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * fix install Signed-off-by: Hemil Desai <hemild@nvidia.com> * Move llama3_pretraining to scripts for now Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Remove img folder & use images from release instead Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Updating default of num_nodes in all recipes Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Adding tests for all recipes Signed-off-by: Marc Romeijn <mromeijn@nvidia.com> * ddAing docstrings Signed-off-by: Marc Romeijn <mromeijn@nvidia.com> * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * Fix failing tests inside test_mixtral_8x7b_64k Signed-off-by: Marc Romeijn <mromeijn@nvidia.com> * Rename fabric to _fabric to avoid name collision with package fabric Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add rename comment Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Marc Romeijn <mromeijn@nvidia.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Jieming Zhang <jiemingz@nvidia.com> Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com> Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Signed-off-by: artbataev <artbataev@users.noreply.github.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com> Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Signed-off-by: Marc Romeyn <mromeijn@nvidia.com> Signed-off-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Signed-off-by: Marc Romeijn <mromeijn@nvidia.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com> Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jieming Zhang <jiemingz@nvidia.com> Co-authored-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com> Co-authored-by: Anna Shors <71393111+ashors1@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: artbataev <artbataev@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ko3n1g <16716991+ko3n1g@users.noreply.github.com> Co-authored-by: Ali Taghibakhshi <ataghibakhsh@login-eos01.eos.clusters.nvidia.com> Co-authored-by: JRD971000 <JRD971000@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
adityavavre
pushed a commit
to adityavavre/NeMo
that referenced
this pull request
Sep 15, 2024
* Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: adityavavre <aditya.vavre@gmail.com>
monica-sekoyan
pushed a commit
that referenced
this pull request
Oct 14, 2024
* Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com>
tomlifu
pushed a commit
to tomlifu/NeMo
that referenced
this pull request
Oct 25, 2024
* Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
tomlifu
pushed a commit
to tomlifu/NeMo
that referenced
this pull request
Oct 25, 2024
* Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
hainan-xv
pushed a commit
to hainan-xv/NeMo
that referenced
this pull request
Nov 5, 2024
* Add option to resume from specific path in AutoResume Signed-off-by: Hemil Desai <hemild@nvidia.com> * Fix path Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information