Merge branch 'alit/mamba_convertor_fix' of https://github.com/NVIDIA/…

…NeMo into alit/mamba_convertor_fix
NVIDIA · Aug 21, 2024 · fec5243 · fec5243
2 parents e72d49b + 4dcdf64
commit fec5243
Show file tree

Hide file tree

Showing 96 changed files with 6,212 additions and 624 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -236,19 +236,19 @@ jobs:
     with:
       RUNNER: self-hosted-azure
       SCRIPT: |
+        mkdir /home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}
         export PYTHONPATH=/home/TestData/multimodal/video_neva/LLaVA:$PYTHONPATH
         CUDA_VISIBLE_DEVICES=0 python examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py \
           --in-file /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/llm \
           --mm-projector-ckpt-dir /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/mm_projector \
           --mm-vision-tower /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/vision_tower \
           --tokenizer-model /home/TestData/multimodal/video_neva/vita-tokenizer/ \
           --config-file vita_config.yaml \
-          --out-file=/home/TestData/multimodal/video_neva/llama3-ci-hf/llama3_ci.nemo \
+          --out-file=/home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}/llama3_ci.nemo \
           --model-type VITA \
           --conv-template llama_3
       AFTER_SCRIPT: |
-        rm -f /home/TestData/multimodal/video_neva/llama3-ci-hf/llama3_ci.nemo
-        rm -rf /home/TestData/multimodal/video_neva/llama3-ci-hf/model_weights
+        rm -rf /home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}
   
   # this test is using a 7B model which is too large for GitHub CI
   # replace the model in this test with a toy model or move the test
@@ -4737,7 +4737,8 @@ jobs:
         --vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
         --merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
         --data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
-        --index-mapping-dir=examples/llm/gpt_index_mappings
+        --index-mapping-dir=examples/llm/gpt_index_mappings \
+        --no-masked-softmax-fusion
 
         python examples/llm/megatron_gpt_pretraining.py \
         --devices=2 \
@@ -4746,7 +4747,8 @@ jobs:
         --vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
         --merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
         --data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
-        --index-mapping-dir=examples/llm/gpt_index_mappings
+        --index-mapping-dir=examples/llm/gpt_index_mappings \
+        --no-masked-softmax-fusion
       AFTER_SCRIPT: |
         rm -rf examples/llm/gpt_pretrain_results
         rm -rf examples/llm/gpt_index_mappings

diff --git a/docs/source/asr/asr_language_modeling_and_customization.rst b/docs/source/asr/asr_language_modeling_and_customization.rst
@@ -547,6 +547,69 @@ The following is the list of the arguments for the opengrm script:
 | force                | bool   | ``False``        | Whether to recompile and rewrite all files                                                                      |
 +----------------------+--------+------------------+-----------------------------------------------------------------------------------------------------------------+
 
+.. _wfst-ctc-decoding:
+
+WFST CTC decoding
+=================
+Weighted Finite-State Transducers (WFST) are finite-state machines with input and output symbols on each transition and some weight element of a semiring. WFSTs can act as N-gram LMs in a special type of LM-forced beam search, called WFST decoding.
+
+.. note::
+
+    More precisely, WFST decoding is more of a greedy N-depth search with LM.
+    Thus, it is asymptotically worse than conventional beam search decoding algorithms, but faster.
+
+**WARNING**  
+At the moment, NeMo supports WFST decoding only for CTC models and word-based LMs.
+
+To run WFST decoding in NeMo, one needs to provide a NeMo ASR model and either an ARPA LM or a WFST LM (advanced). An ARPA LM can be built from source text with KenLM as follows: ``<kenlm_bin_path>/lmplz -o <ngram_length> --arpa <out_arpa_path> --prune <ngram_prune>``.
+
+The script to evaluate an ASR model with WFST decoding and N-gram models can be found at
+`scripts/asr_language_modeling/ngram_lm/eval_wfst_decoding_ctc.py
+<https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_wfst_decoding_ctc.py>`__.
+
+This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_wfst_decoding_ctc.py --help`` to see the full list of arguments.
+
+You may evaluate an ASR model as the following:
+
+.. code-block::
+
+    python eval_wfst_decoding_ctc.py nemo_model_file=<path to the .nemo file of the model> \
+           input_manifest=<path to the evaluation JSON manifest file> \
+           arpa_model_file=<path to the ARPA LM model> \
+           decoding_wfst_file=<path to the decoding WFST file> \
+           beam_width=[<list of the beam widths, separated with commas>] \
+           lm_weight=[<list of the LM weight multipliers, separated with commas>] \
+           open_vocabulary_decoding=<whether to use open vocabulary mode for WFST decoding> \
+           decoding_mode=<decoding mode, affects output. Usually "nbest"> \
+           decoding_search_type=<WFST decoding library. Usually "riva"> \
+           preds_output_folder=<optional folder to store the predictions> \
+           probs_cache_file=null
+
+.. note::
+
+    Since WFST decoding is LM-forced (the search goes over the WIDEST graph), only word sequences accepted by the WFST can appear in the decoding results.
+    To circumvent this restriction, one can pass ``open_vocabulary_decoding=true`` (experimental feature).
+
+
+Quick start example
+-------------------
+
+.. code-block::
+
+    wget -O - https://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz | \
+    gunzip -c | tr '[:upper:]' '[:lower:]' > 3-gram.pruned.1e-7.arpa && \
+    python eval_wfst_decoding_ctc.py nemo_model_file="stt_en_conformer_ctc_small_ls" \
+           input_manifest="<data_dir>/Librispeech/test_other.json" \
+           arpa_model_file="3-gram.pruned.1e-7.arpa" \
+           decoding_wfst_file="3-gram.pruned.1e-7.fst" \
+           beam_width=[8] \
+           lm_weight=[0.5,0.6,0.7,0.8,0.9]
+
+.. note::
+
+    Building a decoding WFST is a long process, so it is better to provide a ``decoding_wfst_file`` path even if you don't have it.
+    This way, the decoding WFST will be buffered to the specified file path and there will be no need to re-build it on the next run.
+
 
 ***************************************************
 Context-biasing (word boosting) without external LM

diff --git a/docs/source/checkpoints/intro.rst b/docs/source/checkpoints/intro.rst
@@ -4,8 +4,8 @@ Checkpoints
 
 In this section, we present key functionalities of NVIDIA NeMo related to checkpoint management.
 
-Understanding Checkpoint Formats
---------------------------------
+Checkpoint Formats
+------------------
 
 A ``.nemo`` checkpoint is fundamentally a tar file that bundles the model configurations (specified inside a YAML file), model weights (inside a ``.ckpt`` file), and other artifacts like tokenizer models or vocabulary files. This consolidated design streamlines sharing, loading, tuning, evaluating, and inference.
 
@@ -43,7 +43,7 @@ The following example shows the contents of a quantized model intended to be ser
     └── tokenizer_config.yaml
 
 Community Checkpoint Converter
------------------------------
+------------------------------
 We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Supervised Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation.
 
 We offer comprehensive guides to assist both end users and developers:

diff --git a/docs/source/collections.rst b/docs/source/collections.rst
@@ -25,7 +25,7 @@ Documentation for the individual collections
    multimodal/vlm/intro
    multimodal/text2img/intro
    multimodal/nerf/intro
-   mumtimoda/speech_llm/intro
+   multimodal/speech_llm/intro
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/source/core/core.rst b/docs/source/core/core.rst
@@ -4,7 +4,7 @@ NeMo Models
 Basics
 ------
 
-NeMo models contain everything needed to train and reproduce Conversational AI models:
+NeMo models contain everything needed to train and reproduce conversational AI models:
 
 - neural network architectures 
 - datasets/data loaders
@@ -35,7 +35,7 @@ As an example, we can instantiate QuartzNet with the following:
 
     model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
 
-To see all available pretrained models for a specific NeMo model, use the ``list_available_models()`` method.
+To see all available pretrained models for a specific NeMo model, use the ``list_available_models()`` method:
 
 .. code-block:: Python
 
@@ -52,7 +52,7 @@ Training
 
 NeMo leverages `PyTorch Lightning <https://www.pytorchlightning.ai/>`__ for model training. PyTorch Lightning lets NeMo decouple the
 conversational AI code from the PyTorch training code. This means that NeMo users can focus on their domain (ASR, NLP, TTS) and 
-build complex AI applications without having to rewrite boiler plate code for PyTorch training.
+build complex AI applications without having to rewrite boilerplate code for PyTorch training.
 
 When using PyTorch Lightning, NeMo users can automatically train with:
 
@@ -168,7 +168,7 @@ While validation logic can be found in ``validation_step``:
 
         return {'val_loss': val_loss, 'tp': tp, 'fn': fn, 'fp': fp}
 
-PyTorch Lightning then handles all of the boiler plate code needed for training. Virtually any aspect of training can be customized 
+PyTorch Lightning then handles all of the boilerplate code needed for training. Virtually any aspect of training can be customized
 via PyTorch Lightning `hooks <https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#hooks>`_, 
 `Plugins <https://pytorch-lightning.readthedocs.io/en/stable/extensions/plugins.html>`_, 
 `callbacks <https://pytorch-lightning.readthedocs.io/en/stable/extensions/callbacks.html>`_, or by overriding `methods <https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#methods>`_. 
@@ -239,8 +239,8 @@ Every NeMo example YAML has the same underlying configuration structure:
 - exp_manager
 - model
 
-Model configuration always contain ``train_ds``, ``validation_ds``, ``test_ds``, and ``optim``.  Model architectures vary across 
-domains, therefore, refer to the ASR, NLP, and TTS Collections documentation for more detailed information on Model architecture configuration.
+The model configuration always contains ``train_ds``, ``validation_ds``, ``test_ds``, and ``optim``.  Model architectures, however, can vary across domains.
+Refer to the documentation of specific collections (LLM, ASR etc.) for detailed information on model architecture configuration.
 
 A NeMo configuration file should look similar to the following:
 
@@ -288,15 +288,11 @@ A NeMo configuration file should look similar to the following:
         decoder:
             ...
 
-More specific details about configuration files for each collection can be found on the following pages:
-
-:ref:`NeMo ASR Configuration Files`
-
 CLI
 ~~~
 
 With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots 
-of experiments on compute clusters or for quickly testing parameters while developing.
+of experiments on compute clusters or for quickly testing parameters during development.
 
 All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ come with instructions on how to
 run the training/inference script from the command-line (see `here <https://github.com/NVIDIA/NeMo/blob/4e9da75f021fe23c9f49404cd2e7da4597cb5879/examples/asr/asr_ctc/speech_to_text_ctc.py#L24>`__
@@ -374,15 +370,15 @@ be instantiated and modified like any Python `Dataclass <https://docs.python.org
     # modify the training batch size
     cfg.train_ds.tokens_in_batch = 8192
 
-.. note:: Configuration with Hydra always has the following precedence CLI > YAML > Dataclass
+.. note:: Configuration with Hydra always has the following precedence CLI > YAML > Dataclass.
 
 .. _optimization-label:
 
 Optimization
 ------------
 
 Optimizers and learning rate schedules are configurable across all NeMo models and have their own namespace. Here is a sample YAML 
-configuration for a Novograd optimizer with Cosine Annealing learning rate schedule.
+configuration for a Novograd optimizer with a Cosine Annealing learning rate schedule.
 
 .. code-block:: yaml
 
@@ -408,7 +404,7 @@ configuration for a Novograd optimizer with Cosine Annealing learning rate sched
             warmup_ratio: null
             min_lr: 1e-9:
 
-.. note:: `NeMo Examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ has optimizer and scheduler configurations for every NeMo model.
+.. note:: `NeMo Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples>`_ has optimizer and scheduler configurations for every NeMo model.
 
 Optimizers can be configured from the CLI as well:
 
@@ -596,7 +592,7 @@ as shown below we can update this config prior to restoring the model.
 Register Artifacts
 ------------------
 
-Conversational AI models can be complicated to restore as more information is needed than just the checkpoint weights in order to use the model.
+Restoring conversational AI models can be complicated because it requires more than just the checkpoint weights; additional information is also needed to use the model.
 NeMo models can save additional artifacts in the .nemo file by calling ``.register_artifact``.
 When restoring NeMo models using ``.restore_from`` or ``.from_pretrained``, any artifacts that were registered will be available automatically.
 
@@ -643,7 +639,7 @@ Push to Hugging Face Hub
 NeMo models can be pushed to the `Hugging Face Hub <https://huggingface.co/>`_ with the :meth:`~nemo.core.classes.mixins.hf_io_mixin.HuggingFaceFileIO.push_to_hf_hub` method. This method performs the same actions as ``save_to()`` and then uploads the model to the HuggingFace Hub. It offers an additional ``pack_nemo_file`` argument that allows the user to upload the entire NeMo file or just the ``.nemo`` file. This is useful for large language models that have a massive number of parameters, and a single NeMo file could exceed the max upload size of Hugging Face Hub.
 
 
-Upload a model to the hub
+Upload a model to the Hub
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
@@ -688,15 +684,15 @@ Use a Custom Model Card Template for the Hub
 Nested NeMo Models
 ------------------
 
-In some cases, it may be helpful to use NeMo models inside other NeMo models. For example, we can incorporate language models into ASR models to use in a decoding process to improve accuracy or use hybrid ASR-TTS models to generate audio from the text on the fly to train or finetune the ASR model.
+In some cases, it may be helpful to use NeMo models inside other NeMo models. For example, we can incorporate language models into ASR models to use in a decoding process to improve accuracy or use hybrid ASR-TTS models to generate audio from the text on the fly to train or fine-tune the ASR model.
 
-There are 3 ways to instantiate child models inside parent models:
+There are three ways to instantiate child models inside parent models:
 
 - use subconfig directly
 - use the ``.nemo`` checkpoint path to load the child model
 - use a pretrained NeMo model
 
-To register a child model, use the ``register_nemo_submodule`` method of the parent model. This method will add the child model to a provided model attribute and, in the serialization process, will handle child artifacts correctly and store the child model config in the parent model config in ``config_field``.
+To register a child model, use the ``register_nemo_submodule`` method of the parent model. This method will add the child model to a specified model attribute. During serialization, it will correctly handle child artifacts and store the child model’s configuration in the parent model’s ``config_field``.
 
 .. code-block:: python
 
@@ -746,30 +742,38 @@ To register a child model, use the ``register_nemo_submodule`` method of the par
 Profiling 
 ---------
 
-NeMo offers users two options for profiling: Nsys & CUDA memory profiling. These two options allow users
+NeMo offers users two options for profiling: Nsys and CUDA memory profiling. These two options allow users
 to debug performance issues as well as memory issues such as memory leaks.
 
 To enable Nsys profiling, add the following options to the model config:
-nsys_profile: False
-   start_step: 10  # Global batch to start profiling
-   end_step: 10 # Global batch to end profiling
-   ranks: [0] # Global rank IDs to profile
-   gen_shape: False # Generate model and kernel details including input shapes
 
-Finally, the model training script with:
+.. code-block:: yaml
+
+    nsys_profile: False
+        start_step: 10  # Global batch to start profiling
+        end_step: 10 # Global batch to end profiling
+        ranks: [0] # Global rank IDs to profile
+        gen_shape: False # Generate model and kernel details including input shapes
+
+Finally, run the model training script with:
+
+.. code-block:: bash
+
+    nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python ./examples/...
 
-nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python ./examples/... 
 See more options at `nsight user guide <https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profiling>`_.
 
 
 
 To enable CUDA memory profiling, add the following options to the model config:
 
-memory_profile:
-   enabled: True
-   start_step: 10  # Global batch to start profiling
-   end_step: 10 # Global batch to end profiling
-   rank: 0 # Global rank ID to profile
-   output_path: None # Path to store the profile output file
+.. code-block:: yaml
+
+    memory_profile:
+        enabled: True
+        start_step: 10  # Global batch to start profiling
+        end_step: 10 # Global batch to end profiling
+        rank: 0 # Global rank ID to profile
+        output_path: None # Path to store the profile output file
 
-And invoke your NeMo script without any changes in the invocation command.
+Then invoke your NeMo script without any changes in the invocation command.