Skip to content

Commit

Permalink
Merge branch 'alit/mamba_convertor_fix' of https://github.com/NVIDIA/…
Browse files Browse the repository at this point in the history
…NeMo into alit/mamba_convertor_fix
  • Loading branch information
JRD971000 committed Aug 21, 2024
2 parents e72d49b + 4dcdf64 commit fec5243
Show file tree
Hide file tree
Showing 96 changed files with 6,212 additions and 624 deletions.
12 changes: 7 additions & 5 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -236,19 +236,19 @@ jobs:
with:
RUNNER: self-hosted-azure
SCRIPT: |
mkdir /home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}
export PYTHONPATH=/home/TestData/multimodal/video_neva/LLaVA:$PYTHONPATH
CUDA_VISIBLE_DEVICES=0 python examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py \
--in-file /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/llm \
--mm-projector-ckpt-dir /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/mm_projector \
--mm-vision-tower /home/TestData/multimodal/video_neva/Llama-3-VILA1.5-8B/vision_tower \
--tokenizer-model /home/TestData/multimodal/video_neva/vita-tokenizer/ \
--config-file vita_config.yaml \
--out-file=/home/TestData/multimodal/video_neva/llama3-ci-hf/llama3_ci.nemo \
--out-file=/home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}/llama3_ci.nemo \
--model-type VITA \
--conv-template llama_3
AFTER_SCRIPT: |
rm -f /home/TestData/multimodal/video_neva/llama3-ci-hf/llama3_ci.nemo
rm -rf /home/TestData/multimodal/video_neva/llama3-ci-hf/model_weights
rm -rf /home/TestData/multimodal/video_neva/llama3-ci-hf/${{ github.run_id }}
# this test is using a 7B model which is too large for GitHub CI
# replace the model in this test with a toy model or move the test
Expand Down Expand Up @@ -4737,7 +4737,8 @@ jobs:
--vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
--merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
--data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
--index-mapping-dir=examples/llm/gpt_index_mappings
--index-mapping-dir=examples/llm/gpt_index_mappings \
--no-masked-softmax-fusion
python examples/llm/megatron_gpt_pretraining.py \
--devices=2 \
Expand All @@ -4746,7 +4747,8 @@ jobs:
--vocab-path=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \
--merges-path=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \
--data-path=/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document \
--index-mapping-dir=examples/llm/gpt_index_mappings
--index-mapping-dir=examples/llm/gpt_index_mappings \
--no-masked-softmax-fusion
AFTER_SCRIPT: |
rm -rf examples/llm/gpt_pretrain_results
rm -rf examples/llm/gpt_index_mappings
Expand Down
63 changes: 63 additions & 0 deletions docs/source/asr/asr_language_modeling_and_customization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,69 @@ The following is the list of the arguments for the opengrm script:
| force | bool | ``False`` | Whether to recompile and rewrite all files |
+----------------------+--------+------------------+-----------------------------------------------------------------------------------------------------------------+

.. _wfst-ctc-decoding:

WFST CTC decoding
=================
Weighted Finite-State Transducers (WFST) are finite-state machines with input and output symbols on each transition and some weight element of a semiring. WFSTs can act as N-gram LMs in a special type of LM-forced beam search, called WFST decoding.

.. note::

More precisely, WFST decoding is more of a greedy N-depth search with LM.
Thus, it is asymptotically worse than conventional beam search decoding algorithms, but faster.

**WARNING**
At the moment, NeMo supports WFST decoding only for CTC models and word-based LMs.

To run WFST decoding in NeMo, one needs to provide a NeMo ASR model and either an ARPA LM or a WFST LM (advanced). An ARPA LM can be built from source text with KenLM as follows: ``<kenlm_bin_path>/lmplz -o <ngram_length> --arpa <out_arpa_path> --prune <ngram_prune>``.

The script to evaluate an ASR model with WFST decoding and N-gram models can be found at
`scripts/asr_language_modeling/ngram_lm/eval_wfst_decoding_ctc.py
<https://github.com/NVIDIA/NeMo/blob/stable/scripts/asr_language_modeling/ngram_lm/eval_wfst_decoding_ctc.py>`__.

This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_wfst_decoding_ctc.py --help`` to see the full list of arguments.

You may evaluate an ASR model as the following:

.. code-block::
python eval_wfst_decoding_ctc.py nemo_model_file=<path to the .nemo file of the model> \
input_manifest=<path to the evaluation JSON manifest file> \
arpa_model_file=<path to the ARPA LM model> \
decoding_wfst_file=<path to the decoding WFST file> \
beam_width=[<list of the beam widths, separated with commas>] \
lm_weight=[<list of the LM weight multipliers, separated with commas>] \
open_vocabulary_decoding=<whether to use open vocabulary mode for WFST decoding> \
decoding_mode=<decoding mode, affects output. Usually "nbest"> \
decoding_search_type=<WFST decoding library. Usually "riva"> \
preds_output_folder=<optional folder to store the predictions> \
probs_cache_file=null
.. note::

Since WFST decoding is LM-forced (the search goes over the WIDEST graph), only word sequences accepted by the WFST can appear in the decoding results.
To circumvent this restriction, one can pass ``open_vocabulary_decoding=true`` (experimental feature).


Quick start example
-------------------

.. code-block::
wget -O - https://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz | \
gunzip -c | tr '[:upper:]' '[:lower:]' > 3-gram.pruned.1e-7.arpa && \
python eval_wfst_decoding_ctc.py nemo_model_file="stt_en_conformer_ctc_small_ls" \
input_manifest="<data_dir>/Librispeech/test_other.json" \
arpa_model_file="3-gram.pruned.1e-7.arpa" \
decoding_wfst_file="3-gram.pruned.1e-7.fst" \
beam_width=[8] \
lm_weight=[0.5,0.6,0.7,0.8,0.9]
.. note::

Building a decoding WFST is a long process, so it is better to provide a ``decoding_wfst_file`` path even if you don't have it.
This way, the decoding WFST will be buffered to the specified file path and there will be no need to re-build it on the next run.


***************************************************
Context-biasing (word boosting) without external LM
Expand Down
6 changes: 3 additions & 3 deletions docs/source/checkpoints/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Checkpoints

In this section, we present key functionalities of NVIDIA NeMo related to checkpoint management.

Understanding Checkpoint Formats
--------------------------------
Checkpoint Formats
------------------

A ``.nemo`` checkpoint is fundamentally a tar file that bundles the model configurations (specified inside a YAML file), model weights (inside a ``.ckpt`` file), and other artifacts like tokenizer models or vocabulary files. This consolidated design streamlines sharing, loading, tuning, evaluating, and inference.

Expand Down Expand Up @@ -43,7 +43,7 @@ The following example shows the contents of a quantized model intended to be ser
└── tokenizer_config.yaml
Community Checkpoint Converter
-----------------------------
------------------------------
We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Supervised Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation.

We offer comprehensive guides to assist both end users and developers:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/collections.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Documentation for the individual collections
multimodal/vlm/intro
multimodal/text2img/intro
multimodal/nerf/intro
mumtimoda/speech_llm/intro
multimodal/speech_llm/intro

.. toctree::
:maxdepth: 1
Expand Down
72 changes: 38 additions & 34 deletions docs/source/core/core.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ NeMo Models
Basics
------

NeMo models contain everything needed to train and reproduce Conversational AI models:
NeMo models contain everything needed to train and reproduce conversational AI models:

- neural network architectures
- datasets/data loaders
Expand Down Expand Up @@ -35,7 +35,7 @@ As an example, we can instantiate QuartzNet with the following:
model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
To see all available pretrained models for a specific NeMo model, use the ``list_available_models()`` method.
To see all available pretrained models for a specific NeMo model, use the ``list_available_models()`` method:

.. code-block:: Python
Expand All @@ -52,7 +52,7 @@ Training

NeMo leverages `PyTorch Lightning <https://www.pytorchlightning.ai/>`__ for model training. PyTorch Lightning lets NeMo decouple the
conversational AI code from the PyTorch training code. This means that NeMo users can focus on their domain (ASR, NLP, TTS) and
build complex AI applications without having to rewrite boiler plate code for PyTorch training.
build complex AI applications without having to rewrite boilerplate code for PyTorch training.

When using PyTorch Lightning, NeMo users can automatically train with:

Expand Down Expand Up @@ -168,7 +168,7 @@ While validation logic can be found in ``validation_step``:
return {'val_loss': val_loss, 'tp': tp, 'fn': fn, 'fp': fp}
PyTorch Lightning then handles all of the boiler plate code needed for training. Virtually any aspect of training can be customized
PyTorch Lightning then handles all of the boilerplate code needed for training. Virtually any aspect of training can be customized
via PyTorch Lightning `hooks <https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#hooks>`_,
`Plugins <https://pytorch-lightning.readthedocs.io/en/stable/extensions/plugins.html>`_,
`callbacks <https://pytorch-lightning.readthedocs.io/en/stable/extensions/callbacks.html>`_, or by overriding `methods <https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html#methods>`_.
Expand Down Expand Up @@ -239,8 +239,8 @@ Every NeMo example YAML has the same underlying configuration structure:
- exp_manager
- model

Model configuration always contain ``train_ds``, ``validation_ds``, ``test_ds``, and ``optim``. Model architectures vary across
domains, therefore, refer to the ASR, NLP, and TTS Collections documentation for more detailed information on Model architecture configuration.
The model configuration always contains ``train_ds``, ``validation_ds``, ``test_ds``, and ``optim``. Model architectures, however, can vary across domains.
Refer to the documentation of specific collections (LLM, ASR etc.) for detailed information on model architecture configuration.

A NeMo configuration file should look similar to the following:

Expand Down Expand Up @@ -288,15 +288,11 @@ A NeMo configuration file should look similar to the following:
decoder:
...
More specific details about configuration files for each collection can be found on the following pages:

:ref:`NeMo ASR Configuration Files`

CLI
~~~

With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots
of experiments on compute clusters or for quickly testing parameters while developing.
of experiments on compute clusters or for quickly testing parameters during development.

All NeMo `examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ come with instructions on how to
run the training/inference script from the command-line (see `here <https://github.com/NVIDIA/NeMo/blob/4e9da75f021fe23c9f49404cd2e7da4597cb5879/examples/asr/asr_ctc/speech_to_text_ctc.py#L24>`__
Expand Down Expand Up @@ -374,15 +370,15 @@ be instantiated and modified like any Python `Dataclass <https://docs.python.org
# modify the training batch size
cfg.train_ds.tokens_in_batch = 8192
.. note:: Configuration with Hydra always has the following precedence CLI > YAML > Dataclass
.. note:: Configuration with Hydra always has the following precedence CLI > YAML > Dataclass.

.. _optimization-label:

Optimization
------------

Optimizers and learning rate schedules are configurable across all NeMo models and have their own namespace. Here is a sample YAML
configuration for a Novograd optimizer with Cosine Annealing learning rate schedule.
configuration for a Novograd optimizer with a Cosine Annealing learning rate schedule.

.. code-block:: yaml
Expand All @@ -408,7 +404,7 @@ configuration for a Novograd optimizer with Cosine Annealing learning rate sched
warmup_ratio: null
min_lr: 1e-9:
.. note:: `NeMo Examples <https://github.com/NVIDIA/NeMo/tree/v1.0.2/examples>`_ has optimizer and scheduler configurations for every NeMo model.
.. note:: `NeMo Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples>`_ has optimizer and scheduler configurations for every NeMo model.

Optimizers can be configured from the CLI as well:

Expand Down Expand Up @@ -596,7 +592,7 @@ as shown below we can update this config prior to restoring the model.
Register Artifacts
------------------

Conversational AI models can be complicated to restore as more information is needed than just the checkpoint weights in order to use the model.
Restoring conversational AI models can be complicated because it requires more than just the checkpoint weights; additional information is also needed to use the model.
NeMo models can save additional artifacts in the .nemo file by calling ``.register_artifact``.
When restoring NeMo models using ``.restore_from`` or ``.from_pretrained``, any artifacts that were registered will be available automatically.

Expand Down Expand Up @@ -643,7 +639,7 @@ Push to Hugging Face Hub
NeMo models can be pushed to the `Hugging Face Hub <https://huggingface.co/>`_ with the :meth:`~nemo.core.classes.mixins.hf_io_mixin.HuggingFaceFileIO.push_to_hf_hub` method. This method performs the same actions as ``save_to()`` and then uploads the model to the HuggingFace Hub. It offers an additional ``pack_nemo_file`` argument that allows the user to upload the entire NeMo file or just the ``.nemo`` file. This is useful for large language models that have a massive number of parameters, and a single NeMo file could exceed the max upload size of Hugging Face Hub.


Upload a model to the hub
Upload a model to the Hub
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
Expand Down Expand Up @@ -688,15 +684,15 @@ Use a Custom Model Card Template for the Hub
Nested NeMo Models
------------------

In some cases, it may be helpful to use NeMo models inside other NeMo models. For example, we can incorporate language models into ASR models to use in a decoding process to improve accuracy or use hybrid ASR-TTS models to generate audio from the text on the fly to train or finetune the ASR model.
In some cases, it may be helpful to use NeMo models inside other NeMo models. For example, we can incorporate language models into ASR models to use in a decoding process to improve accuracy or use hybrid ASR-TTS models to generate audio from the text on the fly to train or fine-tune the ASR model.

There are 3 ways to instantiate child models inside parent models:
There are three ways to instantiate child models inside parent models:

- use subconfig directly
- use the ``.nemo`` checkpoint path to load the child model
- use a pretrained NeMo model

To register a child model, use the ``register_nemo_submodule`` method of the parent model. This method will add the child model to a provided model attribute and, in the serialization process, will handle child artifacts correctly and store the child model config in the parent model config in ``config_field``.
To register a child model, use the ``register_nemo_submodule`` method of the parent model. This method will add the child model to a specified model attribute. During serialization, it will correctly handle child artifacts and store the child model’s configuration in the parent model’s ``config_field``.

.. code-block:: python
Expand Down Expand Up @@ -746,30 +742,38 @@ To register a child model, use the ``register_nemo_submodule`` method of the par
Profiling
---------

NeMo offers users two options for profiling: Nsys & CUDA memory profiling. These two options allow users
NeMo offers users two options for profiling: Nsys and CUDA memory profiling. These two options allow users
to debug performance issues as well as memory issues such as memory leaks.

To enable Nsys profiling, add the following options to the model config:
nsys_profile: False
start_step: 10 # Global batch to start profiling
end_step: 10 # Global batch to end profiling
ranks: [0] # Global rank IDs to profile
gen_shape: False # Generate model and kernel details including input shapes

Finally, the model training script with:
.. code-block:: yaml
nsys_profile: False
start_step: 10 # Global batch to start profiling
end_step: 10 # Global batch to end profiling
ranks: [0] # Global rank IDs to profile
gen_shape: False # Generate model and kernel details including input shapes
Finally, run the model training script with:

.. code-block:: bash
nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python ./examples/...
nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python ./examples/...
See more options at `nsight user guide <https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profiling>`_.



To enable CUDA memory profiling, add the following options to the model config:

memory_profile:
enabled: True
start_step: 10 # Global batch to start profiling
end_step: 10 # Global batch to end profiling
rank: 0 # Global rank ID to profile
output_path: None # Path to store the profile output file
.. code-block:: yaml
memory_profile:
enabled: True
start_step: 10 # Global batch to start profiling
end_step: 10 # Global batch to end profiling
rank: 0 # Global rank ID to profile
output_path: None # Path to store the profile output file
And invoke your NeMo script without any changes in the invocation command.
Then invoke your NeMo script without any changes in the invocation command.
Loading

0 comments on commit fec5243

Please sign in to comment.