Update pruning and distillation tutorial notebooks #11091

gvenkatakris · 2024-10-29T22:52:42Z

What does this PR do ?

Updating pruning and distillation notebooks
width-pruning notebook added

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb

tutorials/llm/llama-3/pruning-distillation/README.rst

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

jgerh · 2024-11-13T17:59:35Z

tutorials/llm/llama-3/README.rst

@@ -17,6 +17,6 @@ This repository contains jupyter notebook tutorials using NeMo Framework for Lla
   * - `Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM <./sdg-law-title-generation>`_
     - `Law StackExchange <https://huggingface.co/datasets/ymoslem/Law-StackExchange>`_
     - Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a pre-requisite, follow the tutorial for  `data curation using NeMo Curator <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg>`__.
-   * - `Llama 3.1 WikiText Pruning and Distillation with NeMo Framework <./pruning-distillation>`_


Line 5/5 fix capitalization

This repository contains Jupyter Notebook tutorials using the NeMo Framework for LLama-3 and LLama-3.1 models by Meta.

jgerh · 2024-11-13T18:00:56Z

tutorials/llm/llama-3/README.rst

@@ -17,6 +17,6 @@ This repository contains jupyter notebook tutorials using NeMo Framework for Lla
   * - `Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM <./sdg-law-title-generation>`_
     - `Law StackExchange <https://huggingface.co/datasets/ymoslem/Law-StackExchange>`_
     - Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a pre-requisite, follow the tutorial for  `data curation using NeMo Curator <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg>`__.


Line 19/19 fix punctuation.

Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a prerequisite, follow the tutorial for data curation using NeMo Curator <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg>__.

jgerh · 2024-11-13T18:04:34Z

tutorials/llm/llama-3/pruning-distillation/01_data_preparation.ipynb

+    "\n",
+    "The dataset has to be preprocessed using the [preprocess_data_for_megatron.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py) script included in the NeMo Framework. This step will also tokenize data using the `meta-llama/Meta-Llama-3.1-8B` tokenizer model to convert the data into a memory map format.\n",
+    "\n",
+    "> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files."


fix punctuation

In the block of code below, pass the paths to your train, test, and validation data files.

jgerh · 2024-11-13T18:07:45Z

tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb

+   "metadata": {},
+   "source": [
+    "\n",
+    "### Step 2: Finetune the teacher on the dataset\n",


fix punctuation

Step 2: Fine-tune the teacher on the dataset

jgerh · 2024-11-13T18:09:44Z

tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb

+    "\n",
+    "### Step 2: Finetune the teacher on the dataset\n",
+    "\n",
+    "NeMo framework includes a standard python script [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py) for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!\n",


fix punctuation and capitalization

"NeMo Framework includes a standard Python script, megatron_gpt_pretraining.py, for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!\n",

jgerh · 2024-11-13T18:14:33Z

tutorials/llm/llama-3/pruning-distillation/02_teacher_finetuning.ipynb

+    "\n",
+    "NeMo framework includes a standard python script [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py) for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!\n",
+    "\n",
+    "We finetune the unpruned model on our dataset to correct the distribution shift across the original dataset the model was trained on. Per the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), experiments showed that, without correcting for the distribution shift, the teacher provides suboptimal guidance on the dataset when being distilled.\n",


fix punctuation

We fine-tune the unpruned model on our dataset to correct the distribution shift from the original dataset the model was trained on. According to the blog and tech report, experiments showed that without correcting for this distribution shift, the teacher provides suboptimal guidance on the dataset during distillation.

jgerh · 2024-11-13T18:35:50Z

tutorials/llm/llama-3/pruning-distillation/04_b_distilling_width_pruned_student.ipynb

+    "#### Step 4.b.: Using width-pruned student\n",
+    "While distilling knowledge from the teacher to width-pruned model, the `STUDENT` model would be  `4b_width_pruned_model.nemo` as produced by the [width-pruning](./03_b_width_pruning.ipynb) notebook. This training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps.\n",
+    "\n",
+    "> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test and validation data files as well as path to the teacher and student .nemo models."


fix punctuation

NOTE: In the block of code below, pass the paths to your pre-processed train, test, and validation data files, as well as path to the teacher and student .nemo models."

jgerh · 2024-11-13T18:37:47Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+    "### Step 5: Display the validation loss\n",
+    "\n",
+    "Now that the results are in, let's visualize the validation loss of the two distilled models using the `tensorboard` library. \n",
+    "> `NOTE:` This notebook demonstrates the use of the teacher finetuning, pruning and the distillation script. These scripts should ideally be run on a multi-node cluster with a larger `GLOBAL_BATCH_SIZE` and `STEPS` to see improvement in the validation loss."


fix punctuation

NOTE: This notebook demonstrates the use of the teacher fine-tuning, pruning, and the distillation script. These scripts should ideally be run on a multi-node cluster with a larger GLOBAL_BATCH_SIZE and STEPS to see improvement in the validation loss."

jgerh · 2024-11-13T18:38:28Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+   "id": "b5822d62-8131-4046-8c22-0bf0fce81df7",
+   "metadata": {},
+   "source": [
+    "#### Validation Loss using depth-pruned model as student in distillation script\n",


fix capitalization

Validation Loss Using Depth-Pruned Model as Student in Distillation Script\n",

jgerh · 2024-11-13T18:40:07Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+   "metadata": {},
+   "source": [
+    "#### Validation Loss using depth-pruned model as student in distillation script\n",
+    "Here is an image of the validation loss over 30 steps of running the training step in the distillation script when we distill the knowledge from the finetuned teacher model to the depth-pruned student."


fix punctuation, revise sentence

"Here is an image of the validation loss over 30 steps of running the training step in the distillation script, where we distill the knowledge from the fine-tuned teacher model to the depth-pruned student."

jgerh · 2024-11-13T18:41:43Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+    {
+     "data": {
+      "text/html": [
+       "<h5>Validation Loss over 30 Training Steps with Depth-Pruned model as Student</h5>"


fix capitalization

Validation Loss over 30 Training Steps with Depth-Pruned Model as Student

jgerh · 2024-11-13T18:42:08Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+   ],
+   "source": [
+    "from IPython.display import Image, display, HTML\n",
+    "title = \"Validation Loss over 30 Training Steps with Depth-Pruned model as Student\"\n",


fix capitalization

title = "Validation Loss over 30 Training Steps with Depth-Pruned Model as Student"\n",

jgerh · 2024-11-13T18:43:11Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+   "id": "f10041ae-6533-47de-9f76-f97d4469c27a",
+   "metadata": {},
+   "source": [
+    "#### Validation Loss using width-pruned model as student in distillation script\n",


fix capitalization

Validation Loss Using Width-Pruned Model as Student in Distillation Script\n",

jgerh · 2024-11-13T18:44:20Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+   "metadata": {},
+   "source": [
+    "#### Validation Loss using width-pruned model as student in distillation script\n",
+    "Here is an image of the validation loss over 30 steps of running the training step in the distillation script when we distill the knowledge from the finetuned teacher model to the width-pruned student."


fix capitalization, revise sentence

"Here is an image of the validation loss over 30 steps of running the training step in the distillation script, where we distill the knowledge from the fine-tuned teacher model to the width-pruned student."

jgerh · 2024-11-13T18:44:55Z

tutorials/llm/llama-3/pruning-distillation/05_display_results.ipynb

+    {
+     "data": {
+      "text/html": [
+       "<h5>Validation Loss over 30 Training Steps with Width-Pruned model as Student</h5>"


fix capitalization

"
Validation Loss over 30 Training Steps with Width-Pruned Model as Student
"

jgerh · 2024-11-13T22:33:15Z

tutorials/llm/llama-3/pruning-distillation/README.rst

@@ -1,18 +1,26 @@
-Llama 3.1 WikiText Pruning and Distillation with NeMo Framework
+Llama 3.1 Pruning and Distillation with NeMo Framework
 =======================================================================================

 `Llama 3.1 <https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/>`_ are open-source large language models by Meta that deliver state-of-the-art performance on popular industry benchmarks. They have been pretrained on over 15 trillion tokens, and support a 128K token context length. They are available in three sizes, 8B, 70B, and 405B, and each size has two variants—base pretrained and instruction tuned.


revise paragraph

LLama 3.1 models, developed by Meta, are open-source large language models that deliver state-of-the-art performance on popular industry benchmarks. Pretrained on over 15 trillion tokens, they support a 128K token context length. These models are available in three sizes: 8B, 70B, and 405B. Each size offers two variants: base pretrained and instruction tuned.

jgerh · 2024-11-13T22:33:49Z

tutorials/llm/llama-3/pruning-distillation/README.rst

@@ -1,18 +1,26 @@
-Llama 3.1 WikiText Pruning and Distillation with NeMo Framework
+Llama 3.1 Pruning and Distillation with NeMo Framework
 =======================================================================================

 `Llama 3.1 <https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/>`_ are open-source large language models by Meta that deliver state-of-the-art performance on popular industry benchmarks. They have been pretrained on over 15 trillion tokens, and support a 128K token context length. They are available in three sizes, 8B, 70B, and 405B, and each size has two variants—base pretrained and instruction tuned.

 `NVIDIA NeMo Framework <https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html>`_ provides tools to perform teacher finetuning, pruning and distillation on Llama 3.1 to fit your use case.


fix punctuation

NVIDIA NeMo Framework <https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html>_ provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 to fit your use case.

jgerh · 2024-11-13T22:34:34Z

tutorials/llm/llama-3/pruning-distillation/README.rst

 =======================================================================================

 `Llama 3.1 <https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/>`_ are open-source large language models by Meta that deliver state-of-the-art performance on popular industry benchmarks. They have been pretrained on over 15 trillion tokens, and support a 128K token context length. They are available in three sizes, 8B, 70B, and 405B, and each size has two variants—base pretrained and instruction tuned.

 `NVIDIA NeMo Framework <https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html>`_ provides tools to perform teacher finetuning, pruning and distillation on Llama 3.1 to fit your use case.

+`NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ is a library (referred to as **Model Optimizer**, or **ModelOpt**) comprising state-of-the-art model optimization techniques including `quantization <https://github.com/NVIDIA/TensorRT-Model-Optimizer#quantization>`_, `sparsity <https://github.com/NVIDIA/TensorRT-Model-Optimizer#sparsity>`_, `distillation <https://github.com/NVIDIA/TensorRT-Model-Optimizer#distillation>`_, and `pruning <https://github.com/NVIDIA/TensorRT-Model-Optimizer#pruning>`_ to compress models.
+
 `LLM Pruning and Distillation in Practice: The Minitron Approach <https://arxiv.org/abs/2408.11796>`_ provides tools to perform teacher finetuning, pruning and distillation on Llama 3.1 as described in the `tech report <https://arxiv.org/abs/2408.11796>`_.


fix punctuation

LLM Pruning and Distillation in Practice: The Minitron Approach <https://arxiv.org/abs/2408.11796>_ provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 as described in the tech report <https://arxiv.org/abs/2408.11796>_.

jgerh · 2024-11-13T22:39:34Z

tutorials/llm/llama-3/pruning-distillation/README.rst

+This tutorial shows how to perform depth-pruning, teacher finetuning and distillation on **Llama 3.1 8B** using the `WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>`_ dataset with NeMo Framework. The `WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>`_ language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. For this demonstration, we will perform teacher correction by running a light finetuning procedure on the ``Meta Llama 3.1 8B`` teacher model to generate a finetuned teacher model ``megatron_llama_ft.nemo`` needed for optimal distillation. This finetuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will be exploring both pruning techniques which will yield ``4b_depth_pruned_model.nemo`` and ``4b_width_pruned_model.nemo`` respectively. These models will serve as a starting point for distillation to create the final distilled 4B models.
 We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration.


fix punctuation and revise paragraph

This tutorial demonstrates how to perform depth-pruning, teacher fine-tuning, and distillation on LLama 3.1 8B using the WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>_ dataset with the NeMo Framework. The WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>_ language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the Meta LLama 3.1 8B teacher model to generate a fine-tuned teacher model, megatron_llama_ft.nemo, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 4b_depth_pruned_model.nemo and 4b_width_pruned_model.nemo, respectively. These models will serve as starting points for distillation to create the final distilled 4B models.

jgerh · 2024-11-13T22:40:03Z

tutorials/llm/llama-3/pruning-distillation/README.rst

 We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration.

+``NOTE:`` A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable but will be supported in future releases.


fix punctuation

NOTE: A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable, but will be supported in future releases.

jgerh · 2024-11-13T22:49:55Z

tutorials/llm/llama-3/pruning-distillation/README.rst

 We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration.

+``NOTE:`` A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable but will be supported in future releases.
+


Line 20/28 revise bullet text

Access to at least 8 NVIDIA GPUs, each with a memory of at least 80GB (e.g., 8 x H100-80GB or 8 x A100-80GB).

Line 23/31 fix punctuation

Authenticate with NVIDIA NGC <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-authentication>_ and download NGC CLI Tool <https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-cli-tool>_. You will use this tool to download the model and customize it with NeMo Framework.

Line 27/35 revise note text

NOTE: The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs. However, you can potentially reduce the Tensor Parallel size (TENSOR_PARALLEL_SIZE) along with the Micro-Batchsize (MICRO_BATCH_SIZE) in the teacher fine-tuning and distillation scripts to accommodate lower resource availability.

jgerh · 2024-11-13T22:51:56Z

tutorials/llm/llama-3/pruning-distillation/README.rst

@@ -31,14 +39,16 @@ Create a pruned and distilled model with NeMo Framework

 For pruning and distilling the model, you will use the NeMo Framework which is available as a `docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`_.

+``NOTE:`` These notebooks use `NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ under the hood for pruning and distillation.


revise note

NOTE: These notebooks use the NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>_ under the hood for pruning and distillation.

jgerh · 2024-11-13T22:54:02Z

tutorials/llm/llama-3/pruning-distillation/README.rst


+This directory contains a list of notebooks which will go over all the steps to create a distilled 4B model.


revise text

This directory contains a list of notebooks that cover all the steps to create a distilled 4B model.

jgerh · 2024-11-13T22:54:43Z

tutorials/llm/llama-3/pruning-distillation/README.rst

 Results
 ------------------------------------------------------------------------------
-``NOTE:`` This notebook demonstrates the use of the teacher finetuning, pruning and the distillation script. These scripts should ideally be run on a multi-node cluster with a larger ``GLOBAL_BATCH_SIZE`` and ``STEPS`` to see improvement in the validation loss.
+``NOTE:`` This notebook demonstrates the use of the teacher finetuning, pruning and the distillation scripts. These scripts should ideally be run on a multi-node cluster with a larger ``GLOBAL_BATCH_SIZE`` and ``STEPS`` to see improvement in the validation loss.


fix punctuation

NOTE: This notebook demonstrates the use of the teacher fine-tuning, pruning, and the distillation scripts. These scripts should ideally be run on a multi-node cluster with a larger GLOBAL_BATCH_SIZE and STEPS to see improvement in the validation loss.

jgerh · 2024-11-13T22:56:00Z

tutorials/llm/llama-3/pruning-distillation/README.rst


-.. figure:: https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/val_loss_distillation.png
+  Figure 1: Validation Loss Plot when using the depth-pruned model as the student


fix capitalization

Figure 1: Validation Loss Plot When Using the Depth-Pruned Model as the Student

jgerh · 2024-11-13T22:56:34Z

tutorials/llm/llama-3/pruning-distillation/README.rst

fix capitalization

Figure 2: Validation Loss Plot When Using the Width-Pruned Model as the Student

jgerh · 2024-11-13T23:15:32Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "This demonstration showcases performing pruning and distillation on **Llama 3.1-8B** with the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset using NeMo Framework. The [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) language modeling dataset is a collection of over 100 million tokens extracted from the set of verified 'Good' and 'Featured' articles on Wikipedia. \n",
+    "\n",
+    "For this demonstration, we will perform a light finetuning procedure on the `Meta Llama 3.1 8B` teacher model to generate a finetuned teacher model. This finetuned teacher model will then be trimmed. There are two methods to prune a model: depth-pruning and width-pruning. This workflow will showcase both methods which will yield `4b_depth_pruned_model.nemo` and `4b_width_pruned_model.nemo` respectively, that will serve as a starting point for distillation to the final 4B models. \n",


fix punctuation and revise paragraph

This tutorial demonstrates how to perform depth-pruning, teacher fine-tuning, and distillation on LLama 3.1 8B using the WikiText-103-v1 dataset with NeMo Framework. The WikiText-103-v1 language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the Meta Llama 3.1 8B teacher model to generate a fine-tuned teacher model, megatron_llama_ft.nemo, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 4b_depth_pruned_model.nemo and 4b_width_pruned_model.nemo, respectively. These models will serve as starting points for distillation to create the final distilled 4B models.

jgerh · 2024-11-13T23:17:23Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "\n",
+    "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. \n",
+    "\n",
+    "**Instructions are available in the associated tutorial README to download the model and the container.**"


revise note text and add a link to the README file

"Instructions for downloading the model and the container are available in the README."

jgerh · 2024-11-13T23:18:58Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+   "source": [
+    "---\n",
+    "## Prerequisites\n",
+    "Ensure you have the following -\n",


revise texy

Ensure you meet the prerequisites listed in this section.

jgerh · 2024-11-13T23:19:43Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "---\n",
+    "## Prerequisites\n",
+    "Ensure you have the following -\n",
+    "1. **Get the teacher model**: Download the `Meta Llama 3.1 8B .nemo` model. You must follow the instructions in the associated README to download and mount the folder to the NeMo FW container."


Use full NeMo Framework name

"1. **Get the teacher model**: Download the `Meta Llama 3.1 8B .nemo` model. You must follow the instructions in the associated README to download and mount the folder to the NeMo Framework container."

jgerh · 2024-11-13T23:20:53Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+   },
+   "source": [
+    "---\n",
+    "##  Step-by-step instructions\n",


fix capitalization in heading

"## Step-by-Step Instructions\n",

jgerh · 2024-11-13T23:26:22Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "This workflow is structured into seven notebooks:\n",
+    "1. [Prepare the dataset](./01_data_preparation.ipynb)\n",
+    "2. [Finetune the teacher on the dataset](./02_teacher_finetuning.ipynb)\n",
+    "3. Prune the finetuned-teacher model to create a student \n",


fix punctuation

"3. Prune the fine-tuned teacher model to create a student\n",

jgerh · 2024-11-13T23:28:23Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "\n",
+    "This workflow is structured into seven notebooks:\n",
+    "1. [Prepare the dataset](./01_data_preparation.ipynb)\n",
+    "2. [Finetune the teacher on the dataset](./02_teacher_finetuning.ipynb)\n",


fix punctuation

"2. Fine-tune the teacher on the dataset\n",

jgerh · 2024-11-13T23:29:06Z

tutorials/llm/llama-3/pruning-distillation/introduction.ipynb

+    "   - 4.b. [Using width-pruned student](./04_b_distilling_width_pruned_student.ipynb)\n",
+    "5. [Display the validation loss](./05_display_results.ipynb)\n",
+    "\n",
+    "> `NOTE:` We are exploring two methods to prune the finetuned teacher model: [depth-pruning](./03_a_depth_pruning.ipynb) and [width-pruning](./03_b_width_pruning.ipynb). Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning so users can choose to perform either [depth-pruning](./03_a_depth_pruning.ipynb) or [width-pruning](./03_b_width_pruning.ipynb) or both methods."


fix punctuation

"> `NOTE:` We are exploring two methods to prune the fine-tuned teacher model: [depth-pruning](./03_a_depth_pruning.ipynb) and [width-pruning](./03_b_width_pruning.ipynb). Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning so users can choose to perform either [depth-pruning](./03_a_depth_pruning.ipynb) or [width-pruning](./03_b_width_pruning.ipynb) or both methods."

* Timestamps to transcribe (#10950) * inital version Signed-off-by: Nithin Rao Koluguri <nithinraok> * Support for RNNT, TDT, Hybrid Models Signed-off-by: Nithin Rao Koluguri <nithinraok> * move change of decoder stratery from mixin to individual model class Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> * update transcribe_speech.py Signed-off-by: Nithin Rao Koluguri <nithinraok> * uncomment Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> * add docs Signed-off-by: Nithin Rao Koluguri <nithinraok> * fix docs Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> * codeql fixes Signed-off-by: Nithin Rao Koluguri <nithinraok> * unit tests Signed-off-by: Nithin Rao Koluguri <nithinraok> * minor rebase fix Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> * add None case to restore the state set outside using decoding_stratergy() Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> * remove ipdb traces Signed-off-by: Nithin Rao Koluguri <nithinraok> * updates doc for transcription.py Signed-off-by: Nithin Rao Koluguri <nithinraok> * remove preserve alignment for AED models as it doesn;t support it Signed-off-by: Nithin Rao Koluguri <nithinraok> * lint warnings Signed-off-by: Nithin Rao Koluguri <nithinraok> * Apply isort and black reformatting Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> Co-authored-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: nithinraok <nithinraok@users.noreply.github.com> * [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 1b8fce7 ! (#11247) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 47ff44e ! (#11254) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Handling tokenizer in PTQ for Nemo 2.0 (#11237) * Handling tokenizer in PTQ for Nemo 2.0 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Print log msg and enable overriding Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Warning for legacy tokenizer config Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save HF tokenizer to make tokenizer_config.yaml (almost) redundant Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Handle tokenizer in a unified way Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Move saving context within export Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix typo in get_tokenzier Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Reduce diff Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop unused import Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix finetuning datamodule resume (#11187) * fix datamodule resume Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> * fix subclass Signed-off-by: Chen Cui <chcui@nvidia.com> * docstrings and formats Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> * ci: Move `bump mcore` to templates (#11229) * ci: Move `bump mcore` to templates Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * final Signed-off-by: Oliver Koenig <okoenig@nvidia.com> --------- Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix: Update baseline (#11205) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * Remove deprecated builder_opt param from build command (#11259) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * chore(beep boop 🤖): Bump `MCORE_TAG=aded519...` (2024-11-12) (#11260) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * [Doc fixes] update file names, installation instructions, bad links (#11045) * rename eval_beamsearch_ngram.py to eval_beamsearch_ngram_ctc.py in docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * replace out of date installation instructions with pointer to NeMo README installation section Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * point to user guide instead of readme Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * some link updates Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> * update more links Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> --------- Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> * fix(export): GPT models w/ bias=False convert properly (#11255) Signed-off-by: Terry Kong <terryk@nvidia.com> * ci: Run secrets detector on `pull_request_target` (#11263) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> * fix(export): update API for disabling device reassignment in TRTLLM for Aligner (#10863) * fix(export): update API for disabling device reassignment in TRTLLM for Aligner [feat] Upgrade nemo-export path for aligner to TRTLLM-v12 and use python runtime Signed-off-by: Terry Kong <terryk@nvidia.com> fix: forgot to always set _disable_torch_cuda_device_set Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Apply isort and black reformatting Signed-off-by: terrykong <terrykong@users.noreply.github.com> invert torch device set Signed-off-by: Terry Kong <terryk@nvidia.com> * remove comment Signed-off-by: Terry Kong <terryk@nvidia.com> --------- Signed-off-by: Terry Kong <terryk@nvidia.com> * new vfm training features (#11246) Signed-off-by: Zeeshan Patel <zeeshanp@nvidia.com> Co-authored-by: Zeeshan Patel <zeeshanp@nvidia.com> * Update pruning and distillation tutorial notebooks (#11091) * Update pruning and distillation tutorial notebooks Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update README Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update batch size in width pruning script Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update README Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> --------- Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Beam search algorithm implementation for TDT models (#10903) * initial commit Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: default beam search implementation Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: changed to removing duplicate hypothesis in separate function Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: changed to cartesian product in choosing best hyp Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: minor fixes in comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: maes decoding strategy Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: durations filtering in maes, lm fusion in progress Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: refactored, added comments, command line args, finalized Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: removed prints Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: docs Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix: minor fix Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: rm beam_size=1 exception, rm duplicates check, fix error handling Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: error handling Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix: removed evaluations file Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rn: blank scoring Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm: blank scoring and duration beam size Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix: removed durations_beam_size from default beam search Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: logaddexp Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm: prefix search Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rn: nested loop over extensions Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: bug with caching Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm: topk on durations Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add: restored prefix search Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix: fixed comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * refactored duplicate merging Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * changes batch scoring Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * refactored rnnt batch scoring Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * alsd first working Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * refactored Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * remove stacking operations Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fixes im base class Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * remove potentially uninitialized local variable Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * default beam search minor fixes Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add test, fix maes timesteps Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm file Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm file Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * clean up Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add ngram lm test Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix maes_num_steps=1 Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix kenlm model path Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix kenlm model full path Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * made requested changes Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * merge after isort Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add prints to test Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * add Kenlm to asr requirements Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * remove prints in tests Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add kenlm to test requirements Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm kenlm from link, add package-name Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm second kenlm installation Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * rm kenlm from dependencies make test optional Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix in test Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix in test Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * add comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * add comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * splitted docstrings Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * add comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * splitted docstrings Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * add comments Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fixes to python3 type annotations Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * merging Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * merging Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix in return type Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * fix test Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * Apply isort and black reformatting Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> * rm time_idx Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> * fix comments to python3 style Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> --------- Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> Co-authored-by: lilithgrigoryan <lgrigoryan@nvidia.com> Co-authored-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update nemo1->2 conversion according to changes in main (#11253) * update nemo1->2 conversion according to changes in main Signed-off-by: Huiying Li <willwin.lee@gmail.com> * Apply isort and black reformatting Signed-off-by: HuiyingLi <HuiyingLi@users.noreply.github.com> * format fix Signed-off-by: Huiying Li <willwin.lee@gmail.com> * add docstrings Signed-off-by: Huiying Li <willwin.lee@gmail.com> --------- Signed-off-by: Huiying Li <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <HuiyingLi@users.noreply.github.com> Co-authored-by: HuiyingLi <HuiyingLi@users.noreply.github.com> * Add llama 3.1 recipes (#11273) * add llama 3.1 recipes Signed-off-by: Chen Cui <chcui@nvidia.com> * Apply isort and black reformatting Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> * fix pylint Signed-off-by: Chen Cui <chcui@nvidia.com> * Fix llama3.1 wrong config in io.json --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Ao Tang <aot@nvidia.com> * Fix Finetune Recipe (#11267) * Fix Starcoder_15 SFT recipe * Fix PP type SFT recipe * Fix PP type SFT recipe * Fix Gemma2b SFT TP=1 * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * Fix more sft recipe * remove pp dtype * remove pp dtype * Configure no restart validation loop in nl.Trainer (#11029) * Configure no restart validation loop in nl.Trainer Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix Signed-off-by: Hemil Desai <hemild@nvidia.com> * Skip validation whenever restarting=True Signed-off-by: Hemil Desai <hemild@nvidia.com> * PR feedback Signed-off-by: Hemil Desai <hemild@nvidia.com> * Apply isort and black reformatting Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> Co-authored-by: hemildesai <hemildesai@users.noreply.github.com> * Handle _io_unflatten_object when _thread_local.output_dir is not available (#11199) Signed-off-by: Hemil Desai <hemild@nvidia.com> * change default ckpt name (#11277) Signed-off-by: Maanu Grover <maanug@nvidia.com> * Use MegatronDataSampler in HfDatasetDataModule (#11274) * Use MegatronDataSampler in HfDataset Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Apply isort and black reformatting Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> * Remove opencc upperbound (#10909) Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Signed-off-by: nithinraok <nithinraok@users.noreply.github.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: cuichenx <cuichenx@users.noreply.github.com> Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Zeeshan Patel <zeeshanp@nvidia.com> Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com> Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> Signed-off-by: Huiying Li <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <HuiyingLi@users.noreply.github.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: hemildesai <hemildesai@users.noreply.github.com> Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: akoumpa <akoumpa@users.noreply.github.com> Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: nithinraok <nithinraok@users.noreply.github.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: cuichenx <cuichenx@users.noreply.github.com> Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Zeeshan Patel <zeeshanp@nvidia.com> Co-authored-by: gvenkatakris <gvenkatakris@nvidia.com> Co-authored-by: lilithgrigoryan <38436437+lilithgrigoryan@users.noreply.github.com> Co-authored-by: lilithgrigoryan <lgrigoryan@nvidia.com> Co-authored-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: HuiyingLi <HuiyingLi@users.noreply.github.com> Co-authored-by: Ao Tang <aot@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: hemildesai <hemildesai@users.noreply.github.com> Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: akoumpa <akoumpa@users.noreply.github.com> Co-authored-by: Dong Hyuk Chang <thomaschang26@tutanota.com>

* Update pruning and distillation tutorial notebooks Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update README Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update batch size in width pruning script Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> * Update README Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com> --------- Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

gvenkatakris added 3 commits October 29, 2024 15:43

Update pruning and distillation tutorial notebooks

0d66c1e

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

Update README

9095a70

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

Update batch size in width pruning script

0a54b89

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

kevalmorabia97 requested review from AAnoosheh, kevalmorabia97, sharathts and ericharper November 8, 2024 16:26

kevalmorabia97 previously approved these changes Nov 8, 2024

View reviewed changes

kevalmorabia97 added the Run CICD label Nov 8, 2024

AAnoosheh reviewed Nov 8, 2024

View reviewed changes

tutorials/llm/llama-3/pruning-distillation/03_b_width_pruning.ipynb Show resolved Hide resolved

sharathts reviewed Nov 8, 2024

View reviewed changes

tutorials/llm/llama-3/pruning-distillation/README.rst Show resolved Hide resolved

sharathts reviewed Nov 8, 2024

View reviewed changes

tutorials/llm/llama-3/pruning-distillation/README.rst Outdated Show resolved Hide resolved

Update README

57f33f4

Signed-off-by: Gomathy Venkata Krishnan <gvenkatakris@nvidia.com>

gvenkatakris dismissed kevalmorabia97’s stale review via 57f33f4 November 8, 2024 19:33

kevalmorabia97 requested review from sharathts, AAnoosheh and kevalmorabia97 November 12, 2024 04:15

kevalmorabia97 approved these changes Nov 12, 2024

View reviewed changes

sharathts approved these changes Nov 12, 2024

View reviewed changes

AAnoosheh approved these changes Nov 12, 2024

View reviewed changes

kevalmorabia97 added Run CICD and removed Run CICD labels Nov 13, 2024

kevalmorabia97 enabled auto-merge (squash) November 13, 2024 06:01

kevalmorabia97 merged commit f311b2e into NVIDIA:main Nov 13, 2024
156 of 157 checks passed

jgerh reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pruning and distillation tutorial notebooks #11091

Update pruning and distillation tutorial notebooks #11091

gvenkatakris commented Oct 29, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024

jgerh Nov 13, 2024

		This tutorial shows how to perform depth-pruning, teacher finetuning and distillation on Llama 3.1 8B using the `WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>`_ dataset with NeMo Framework. The `WikiText-103-v1 <https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1>`_ language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. For this demonstration, we will perform teacher correction by running a light finetuning procedure on the ``Meta Llama 3.1 8B`` teacher model to generate a finetuned teacher model ``megatron_llama_ft.nemo`` needed for optimal distillation. This finetuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will be exploring both pruning techniques which will yield ``4b_depth_pruned_model.nemo`` and ``4b_width_pruned_model.nemo`` respectively. These models will serve as a starting point for distillation to create the final distilled 4B models.
		We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration.

		We are using models utilizing the ``meta-llama/Meta-Llama-3.1-8B`` tokenizer for this demonstration.

		``NOTE:`` A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable but will be supported in future releases.

		@@ -31,14 +39,16 @@ Create a pruned and distilled model with NeMo Framework

		For pruning and distilling the model, you will use the NeMo Framework which is available as a `docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`_.

		``NOTE:`` These notebooks use `NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ under the hood for pruning and distillation.


		This directory contains a list of notebooks which will go over all the steps to create a distilled 4B model.


		.. figure:: https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/val_loss_distillation.png
		Figure 1: Validation Loss Plot when using the depth-pruned model as the student

Update pruning and distillation tutorial notebooks #11091

Update pruning and distillation tutorial notebooks #11091

Conversation

gvenkatakris commented Oct 29, 2024

What does this PR do ?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Validation Loss over 30 Training Steps with Width-Pruned Model as Student

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading