Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use torch scaled_dot_product_attention #1

Draft
wants to merge 466 commits into
base: main
Choose a base branch
from
Draft

use torch scaled_dot_product_attention #1

wants to merge 466 commits into from

Conversation

WoodieDudy
Copy link
Owner

No description provided.

github-actions bot and others added 26 commits July 12, 2024 16:56
…IDIA#9715)

* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
* fix legacy ds padding bug

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* avoid code repetition

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* fix typo

Signed-off-by: dimapihtar <dpihtar@gmail.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
…variety of tensors - second try (NVIDIA#9671)

* enables default data step in megatron parallel to operate on a wider variety of tensors coming out of the dataloader

Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

* handles the case where a batch is empty

Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: jomitchellnv <jomitchellnv@users.noreply.github.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

* Allows the default data step to operate on more types
than just dictionaries

Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: jomitchellnv <jomitchellnv@users.noreply.github.com>

---------

Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Signed-off-by: jomitchellnv <jomitchellnv@users.noreply.github.com>
Co-authored-by: jomitchellnv <jomitchellnv@users.noreply.github.com>
Co-authored-by: John St. John <jstjohn@users.noreply.github.com>
…A#9647)

* Fix when optimizers are setup for PEFT

* Apply isort and black reformatting



* Init DDP inside PEFT

* Apply isort and black reformatting



* Some fixes, loss seems to become nan with peft for some reason

* Apply isort and black reformatting



* Loss goes down on fp32

* Apply isort and black reformatting



* Simplifying FNMixin

* Apply isort and black reformatting



* Fix bug with new checkpoint-io

* Apply isort and black reformatting



* Fix failing test: test_peft_on_train_epoch_start_with_adapter

* Apply isort and black reformatting



---------

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
* refactor: README
* refactor: Use new README in `setup.py`

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
* Remove mask if use fusion mask

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hsiehjackson <hsiehjackson@users.noreply.github.com>

---------

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: hsiehjackson <hsiehjackson@users.noreply.github.com>
Co-authored-by: hsiehjackson <hsiehjackson@users.noreply.github.com>
…DIA#9690) (NVIDIA#9694)

* Move tensorstore import inline

* Moving AsyncFinalizableCheckpointIO import inline

* Wrap AsyncCompatibleCheckpointIO in try/catch inside pl.py

* Moving gpt_layer_specs import inline

* Apply isort and black reformatting



---------

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
* add contianer

* modify tutorial

* modify tutorial

* modify tutorial

---------

Co-authored-by: Ali Taghibakhshi <ataghibakhsh@login-eos01.eos.clusters.nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
…#9650) (NVIDIA#9691)

* Nemotron export - fixing megatron_export.py  (NVIDIA#9625)

* Nemotron ONNX export fixed



* Cleanup



* Addressing code review comments



---------




* Including all trainable-params in a PEFT-checkpoint

* Apply isort and black reformatting



* Small fixes to make model-importer work

* Fixing failing tests

---------

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
* [NeMo-UX] Make TE and Apex dependencies optional (NVIDIA#9550)

* Provide a pure pytorch/jit path to avoid required dependency on TE and Apex

Signed-off-by: ashors1 <ashors@nvidia.com>

* add missing file

Signed-off-by: ashors1 <ashors@nvidia.com>

* add minimal gpt pretraining example

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix pre-training datamodule initialization

Signed-off-by: ashors1 <ashors@nvidia.com>

* add non-te/non-apex test

Signed-off-by: ashors1 <ashors@nvidia.com>

* add comment to pretraining script

Signed-off-by: ashors1 <ashors@nvidia.com>

* use microbatch calculator from mcore

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix nemo 2 test name

Signed-off-by: ashors1 <ashors@nvidia.com>

* update Mcore commit for CI

Signed-off-by: ashors1 <ashors@nvidia.com>

* replace apex microbatch calculator with megatron's in more places

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix missing import

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix missed apex import

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>

* move imports

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>

* move imports

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* add types to command-line args

Signed-off-by: ashors1 <ashors@nvidia.com>

* bug fix

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix path

Signed-off-by: ashors1 <ashors@nvidia.com>

* Disable distributed optimizer in nemo 2.0 test

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix optimizer config

Signed-off-by: ashors1 <ashors@nvidia.com>

* update checkpointing

Signed-off-by: ashors1 <ashors@nvidia.com>

* move import

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* fix failing unit test

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix failing test

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Updating num_weights check of RETRO due to underlying changes from mcore RETRO MLM

Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com>

* fix typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove stale warning

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix lora notebook

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix small typo

Signed-off-by: ashors1 <ashors@nvidia.com>

* add import guards to gemma2

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com>
Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com>
Co-authored-by: huvunvidia <huvunvidia@users.noreply.github.com>

* fix cherry-pick

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com>
Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: huvunvidia <86480512+huvunvidia@users.noreply.github.com>
Co-authored-by: huvunvidia <huvunvidia@users.noreply.github.com>
* minor 2.0 bug fix when TE/Apex not installed

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Anna Shors <71393111+ashors1@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: ashors1 <ashors@nvidia.com>
…v variable (NVIDIA#9736) (NVIDIA#9750)

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
* Fix issue with prompt_defaults

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add core level support for grad map tracking

Signed-off-by: smajumdar <titu1994@gmail.com>

* Add core level support for grad map tracking

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply isort and black reformatting

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

* Add tutorial and update repr of formatters

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update docs

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: titu1994 <titu1994@users.noreply.github.com>
…al_batch_size (NVIDIA#9707) (NVIDIA#9753)

Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Anna Shors <71393111+ashors1@users.noreply.github.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
* fix serialization of partial function

* update serialization to handle value.args

Signed-off-by: srabhi <srabhi@nvidia.com>

* add unit test

Signed-off-by: srabhi <srabhi@nvidia.com>

* remove redundant code from unit-test

Signed-off-by: srabhi <srabhi@nvidia.com>

---------

Signed-off-by: srabhi <srabhi@nvidia.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
…or (NVIDIA#9682)

* Speeds up copying of neccesary artifact files with SaveRestoreConnector

Previously, the SaveRestoreConnector would copy and untar entire
checkpoints just to copy out a tokenizer. For models in the >100GB, this
led to timeouts since only rank=0 did this work, while other ranks moved
on and waited at an all-gather barrier (observed NCCL timeout at 10min).

Signed-off-by: Terry Kong <terryk@nvidia.com>

* cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

* restoring logic to previous tempdir logic

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nlp overrides too

Signed-off-by: Terry Kong <terryk@nvidia.com>

* respect return_config

Signed-off-by: Terry Kong <terryk@nvidia.com>

* some unit tests

Signed-off-by: Terry Kong <terryk@nvidia.com>

* nodbg

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

* correct typing

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Fixes directory issue

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: terrykong <terrykong@users.noreply.github.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: terrykong <terrykong@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
* Add checkpoints section

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix title

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add section on ".qnemo" checkpoints (NVIDIA#9503)

* Add 'Quantized Checkpoints' section

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Address review comments

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Distributed checkpointing user guide (NVIDIA#9494)

* Describe shardings and entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Strategies, optimizers, finalize entrypoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Transformations

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Integration

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add link from intro

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply grammar suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Explain the example

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply review suggestions

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add zarr and torch_dist explanation

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* add subsection

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docs/source/checkpoints/intro.rst

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix code block

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* formatting

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
* ci: Add workflow for code-freeze

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* ci: Add workflow for releasing NeMo Tookit

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
@WoodieDudy WoodieDudy force-pushed the sdpa-asr branch 2 times, most recently from 95ea37c to c82fbc3 Compare July 18, 2024 10:20
Signed-off-by: WoodieDudy <goshagks@gmail.com>
malay-nagda and others added 20 commits August 21, 2024 11:50
* 24.07 vboost numbers

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* 175b 512gpus

Signed-off-by: Malay Nagda <malayn@nvidia.com>

---------

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
* fix mamba convert/ add test

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

* add mamba test

* fix ngroup in cicd

---------

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>
Co-authored-by: JRD971000 <JRD971000@users.noreply.github.com>
NVIDIA#10127)

* Resolve merge conflicts with consumed sample logging

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add test file that captures the predict step error

Signed-off-by: John St John <jstjohn@nvidia.com>

* Add fixme comment around proper checkpoint nemo2 handling

Signed-off-by: John St John <jstjohn@nvidia.com>

* Skip megatron training test on CPU nodes

Signed-off-by: John St John <jstjohn@nvidia.com>

* Move output_log to last arg for compatibility

Signed-off-by: John St John <jstjohn@nvidia.com>

* try setting the default root dir in predict to avoid writing artifacts to cwd

Signed-off-by: John St John <jstjohn@nvidia.com>

* Handle the new check for batch samplers to enable predict_step

Signed-off-by: John St John <jstjohn@nvidia.com>

* Only reset the global microbatch, not entire parallel state

Signed-off-by: John St John <jstjohn@nvidia.com>

* Destroy the right sets of state in test of lightning trainer

Signed-off-by: John St John <jstjohn@nvidia.com>

* Fix typo and rename state resetting functions

Signed-off-by: John St John <jstjohn@nvidia.com>

* Run test in a subprocess to avoid contaminating global state

Signed-off-by: John St John <jstjohn@nvidia.com>

---------

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
* add nemotron

* add nemotron exporter. make converted model identical

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* add more config

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* add config

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* import refactor

* Apply isort and black reformatting

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>

* refactor config

* add 22B config

---------

Signed-off-by: suiyoubi <suiyoubi@users.noreply.github.com>
Co-authored-by: suiyoubi <suiyoubi@users.noreply.github.com>
* Riva and k2 ASR WFST decoding (2) (NVIDIA#9391)

* upload

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add comments and use case

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: GNroy <GNroy@users.noreply.github.com>

* add initial doc

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* fix doc and k2+cuda eval

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* isolate decoder components installation and fix suggestions

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: GNroy <GNroy@users.noreply.github.com>

* fix trailing newline

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: GNroy <GNroy@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: GNroy <GNroy@users.noreply.github.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Add DdpParamParityChecker Callback

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Improve messaging

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Rename to DdpParityChecker

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Add ddp test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rename to ddp_parity_checker

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove red. imports

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* test fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* missign import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ignore test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add missing import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* another missing import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make limit_val_batches int

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove dup file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* AG groups decisions on DDP parity

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* Exclude from pytest

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Add L2_NeMo_2_GPT_DDP_Param_Parity_check to NeMo_CICD_Test.needs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: GNroy <GNroy@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: GNroy <GNroy@users.noreply.github.com>
Co-authored-by: Vladimir Bataev <vbataev@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
…ter restoring from a checkpoint (NVIDIA#10225)

Signed-off-by: ashors1 <ashors@nvidia.com>
* Update TRTLLM 0.12

* Add model config

* Change config

* Change deploy script

* Apply isort and black reformatting

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>

* Remove parameter

---------

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>
Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>
Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Signed-off-by: Ante Jukić <ajukic@nvidia.com>
…nd offsets in manifest (NVIDIA#10198)

* Add tests for LazyNeMoIterator and fix case with manifest_only=True and offsets in manifest

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Address code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…ckpoints (NVIDIA#9939)

* perfor serialization using relative paths to allow users to move checkpoints after they're saved

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unused import

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix artifact load

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix path artifact

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove unused import

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
* Add MemoryProfileCallback

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Remove reference cycles, save snapshot on specific ranks

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Remove unnecessary imports

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Update docstring

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

---------

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>
Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com>
…rocessing (NVIDIA#10052)

Flow matching generative model with SSL pretraining framework

Signed-off-by: Pin-Jui Ku <pku@nvidia.com>
Co-authored-by: Kuray107 <Kuray107@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
* Move nemotron transformers + tokenizer imports inline to reduce number of required deps

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

---------

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
* Wrap CPU model init with megatron_lazy_init_context

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Cleanup checkpoint-dir if saving fails

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: WoodieDudy <goshagks@gmail.com>
titu1994 and others added 6 commits August 26, 2024 11:58
Signed-off-by: titu1994 <titu1994@users.noreply.github.com>
Signed-off-by: WoodieDudy <goshagks@gmail.com>
Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>
Signed-off-by: WoodieDudy <goshagks@gmail.com>
Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.