Knowledge distillation tutorial #1698

lindawangg · 2024-09-27T02:19:14Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

What are the changes made in this PR?

Adds a knowledge distillation tutorial on how to distill Llama3.1 8B into Llama3.2 1B

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-09-27T02:19:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1698

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4ab1789 with merge base a899da2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SalmanMohammadi · 2024-09-27T10:20:51Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+This guide will teach you about knowledge distillation (KD) and show you how you can use torchtune to distill a Llama3.1 8B model into Llama3.2 1B.
+If you already know what knowledge distillation is and want to get straight to running your own distillation in torchtune,
+you can jump to knowledge distillation recipe in torchtune, `knowledge_distillation_single_device.py <https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py>`_.


nit

Suggested change

you can jump to knowledge distillation recipe in torchtune, `knowledge_distillation_single_device.py <https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py>`_.

you can jump to the knowledge distillation recipe in torchtune, `knowledge_distillation_single_device.py <https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py>`_.

Actually, it might be better to just reference the latter parts of this tutorial for people who want to jump ahead rather than pointing to the recipe file,

Changed to link the tutorial section instead of recipe.

SalmanMohammadi · 2024-09-27T10:23:15Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+.. image:: /_static/img/kd-simplified.png
+
+The total loss can be configured in many ways. The default KD config in torchtune combines CE loss with


Suggested change

The total loss can be configured in many ways. The default KD config in torchtune combines CE loss with

The total loss can be configured in many ways. The default KD config in torchtune combines the cross-entropy (CE) loss with

SalmanMohammadi · 2024-09-27T10:23:29Z

docs/source/tutorials/llama_kd_tutorial.rst

+.. image:: /_static/img/kd-simplified.png
+
+The total loss can be configured in many ways. The default KD config in torchtune combines CE loss with
+forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,


Suggested change

forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,

the forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,

SalmanMohammadi · 2024-09-27T10:26:41Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+The total loss can be configured in many ways. The default KD config in torchtune combines CE loss with
+forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,
+which is used in standard KD approaches.Forward KL divergence aims to minimize the difference by forcing the student


Suggested change

which is used in standard KD approaches.Forward KL divergence aims to minimize the difference by forcing the student

which is used in standard KD approaches. Forward KL divergence aims to minimize the difference by forcing the student's

which difference are you referring to? : )

SalmanMohammadi · 2024-09-27T10:28:53Z

docs/source/tutorials/llama_kd_tutorial.rst

+      return -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)
+
+There are some details omitted to simplify the computation, but if you'd like to know more,
+you can see the implementation in `ForwardKLLoss <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/torchtune/modules/loss/kd_losses.py>`_.


should be able to do

Suggested change

you can see the implementation in `ForwardKLLoss <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/torchtune/modules/loss/kd_losses.py>`_.

you can see the implementation in :class:`torchtune.modules.loss.ForwardKLLoss`.

SalmanMohammadi · 2024-09-27T10:31:54Z

docs/source/tutorials/llama_kd_tutorial.rst

+With torchtune, we can easily apply knowledge distillation to Llama3, as well as other LLM model families.
+Let's take a look at how you could distill a model using torchtune's `KD recipe <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/recipes/knowledge_distillation_single_device.py>`_.
+
+First, make sure that you have downloaded the Llama3 weights. For this example, we'll use the Llama3.1-8B as teacher and Llama3.2-1B as student.


nit nit:

Suggested change

First, make sure that you have downloaded the Llama3 weights. For this example, we'll use the Llama3.1-8B as teacher and Llama3.2-1B as student.

First, make sure that you have downloaded all the model weights. For this example, we'll use the Llama3.1-8B as teacher and Llama3.2-1B as student.

SalmanMohammadi · 2024-09-27T10:32:56Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+    tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth" --hf_token <HF_TOKEN>
+
+Then, we will fine-tune the teacher model with using LoRA. Based on our experiments and previous work,


nit nit (ignore if you like):

Suggested change

Then, we will fine-tune the teacher model with using LoRA. Based on our experiments and previous work,

Then, we will fine-tune the teacher model with using LoRA. Based on our experiments and previous work,

SalmanMohammadi · 2024-09-27T10:35:17Z

docs/source/tutorials/llama_kd_tutorial.rst

+and `commonsense_qa <https://github.com/EleutherAI/lm-evaluation-harness/tree/b62b9bd/lm_eval/tasks/commonsense_qa>`_
+through `EleutherEval <https://github.com/EleutherAI/lm-evaluation-harness/tree/main>`_.


Suggested change

and `commonsense_qa <https://github.com/EleutherAI/lm-evaluation-harness/tree/b62b9bd/lm_eval/tasks/commonsense_qa>`_

through `EleutherEval <https://github.com/EleutherAI/lm-evaluation-harness/tree/main>`_.

and `commonsense_qa <https://github.com/EleutherAI/lm-evaluation-harness/tree/b62b9bd/lm_eval/tasks/commonsense_qa>`_ tasks

through the EleutherAI `LM evaluation harness <https://github.com/EleutherAI/lm-evaluation-harness/tree/main>`_.

SalmanMohammadi · 2024-09-27T10:36:26Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+    tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device
+
+Ablation studies


This whole section is awesome

SalmanMohammadi · 2024-09-27T10:39:25Z

docs/source/tutorials/llama_kd_tutorial.rst

+Hyperparameter tuning: learning rate
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, the config has the learning rate as 3e-4, same as LoRA configs. For these experiments,


Suggested change

By default, the config has the learning rate as 3e-4, same as LoRA configs. For these experiments,

By default, the config has the learning rate as ``3e-4`` which is the same as the LoRA configs. For these experiments,

SalmanMohammadi · 2024-09-27T10:41:15Z

docs/source/tutorials/llama_kd_tutorial.rst

+============================
+Distilling Llama3 8B into 1B
+============================


Suggested change

============================

Distilling Llama3 8B into 1B

============================

==================================================

Distilling Llama3.1 8B into Llama3.2 1B using Knowledge Distillation

==================================================

SalmanMohammadi · 2024-09-27T10:43:09Z

docs/source/tutorials/llama_kd_tutorial.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, the config has the learning rate as 3e-4, same as LoRA configs. For these experiments,
+we changed the learning rate from as high as 1e-3 to as low as 1e-5. To change the learning rate,


mega nit: should be able to use latexify this using something like :math:1e^{-1}

Changed all to use math.

SalmanMohammadi · 2024-09-27T10:43:51Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+.. code-block:: bash
+
+    tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device optimizer.lr=[LR]


I think it's clearer to provide a more concrete (vs general) example here

Suggested change

tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device optimizer.lr=[LR]

tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device optimizer.lr=1e-3

SalmanMohammadi · 2024-09-27T10:44:09Z

docs/source/tutorials/llama_kd_tutorial.rst

+Hyperparameter tuning: KD ratio
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the config, we have the kd_ratio as 0.5, which gives even weightings to both the class and KD loss. In these experiments,


Suggested change

In the config, we have the kd_ratio as 0.5, which gives even weightings to both the class and KD loss. In these experiments,

In the config, we have the ``kd_ratio`` as 0.5, which gives even weightings to both the class and KD loss. In these experiments,

SalmanMohammadi · 2024-09-27T10:44:21Z

docs/source/tutorials/llama_kd_tutorial.rst

+
+.. code-block:: bash
+
+    tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device kd_ratio=[KD_RATIO]


Same comment as above

SalmanMohammadi · 2024-09-27T10:44:32Z

docs/source/tutorials/llama_kd_tutorial.rst

+Qwen2 1.5B to 0.5B
+^^^^^^^^^^^^^^^^^^
+
+The KD recipe can also be applied to different model families as well. Here we look at the effect of KD when the number of


mega nit:

Suggested change

The KD recipe can also be applied to different model families as well. Here we look at the effect of KD when the number of

The KD recipe can also be applied to different model families. Here we look at the effect of KD when the number of

SalmanMohammadi · 2024-09-27T10:45:03Z

docs/source/tutorials/llama_kd_tutorial.rst

+^^^^^^^^^^^^^^^^^^
+
+The KD recipe can also be applied to different model families as well. Here we look at the effect of KD when the number of
+parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, which can be found in


nit:

Suggested change

parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, which can be found in

parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, the configs for which can be found in

SalmanMohammadi · 2024-09-27T10:45:52Z

docs/source/tutorials/llama_kd_tutorial.rst

+The KD recipe can also be applied to different model families as well. Here we look at the effect of KD when the number of
+parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, which can be found in
+`qwen2/knowledge_distillation_single_device <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/recipes/configs/qwen2/knowledge_distillation_single_device.yaml>`_
+config. Here we see that alpaca_cleaned_dataset only improves truthful_qa performance and drops the metrics for the other evaluation tasks.


Suggested change

config. Here we see that alpaca_cleaned_dataset only improves truthful_qa performance and drops the metrics for the other evaluation tasks.

config. Here we see that training on the alpaca cleaned dataset only improves truthful_qa performance and drops the metrics for the other evaluation tasks.

SalmanMohammadi · 2024-09-27T10:46:47Z

Thanks so much for adding this. I've been loosely following along your KD work and it's really impressive.

Lots of small nits but this is an awesome tutorial. I especially love all the empirical results.

SalmanMohammadi · 2024-09-27T10:51:44Z

docs/source/tutorials/llama_kd_tutorial.rst

+      return -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)
+
+There are some details omitted to simplify the computation, but if you'd like to know more,
+you can see the implementation in `ForwardKLLoss <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/torchtune/modules/loss/kd_losses.py>`_.


Is it perhaps worth mentioning that we actually use ForwardKLWithChunkedOutputLoss by default?

added a note that we used the chunked loss to save memory.

ebsmothers · 2024-09-27T15:56:40Z

docs/source/tutorials/llama_kd_tutorial.rst

+What is Knowledge Distillation?
+-------------------------------
+
+`Knowledge Distillation <https://arxiv.org/pdf/1503.02531>`_ is is a widely used compression technique


Suggested change

`Knowledge Distillation <https://arxiv.org/pdf/1503.02531>`_ is is a widely used compression technique

`Knowledge Distillation <https://arxiv.org/pdf/1503.02531>`_ is a widely used compression technique

ebsmothers · 2024-09-27T15:57:39Z

docs/source/tutorials/llama_kd_tutorial.rst

+How does Knowledge Distillation work?
+-------------------------------------
+
+Knowledge is transferred from the teacher to student model by training it on a transfer set and where the


Suggested change

Knowledge is transferred from the teacher to student model by training it on a transfer set and where the

Knowledge is transferred from the teacher to student model by training it on a transfer set where the

ebsmothers · 2024-09-27T16:00:09Z

docs/source/tutorials/llama_kd_tutorial.rst

+you can see the implementation in :class:`torchtune.modules.loss.ForwardKLLoss`.
+By default, the KD configs use :class:`torchtune.modules.loss.ForwardKLWithChunkedOutputLoss` to reduce memory.


I think you need to add these APIs to the rst file here. This way it'll render as an actual pointer to our API ref

added. Thanks for the pointer.

SalmanMohammadi · 2024-09-27T17:52:03Z

LGTM - please acquire Evan's blessing before merging.

ebsmothers

🙌

lindawangg added 4 commits September 26, 2024 12:18

kd doc initial commit

3a23ceb

added imgs

7b717f3

ablation studies

33abc2b

added qwen2 results

f2f9cf9

lindawangg requested a review from ebsmothers September 27, 2024 02:19

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 27, 2024

fix command

0aef4a5

SalmanMohammadi reviewed Sep 27, 2024

View reviewed changes

lindawangg added 2 commits September 27, 2024 07:31

addressing comments

c2c7fa3

fixed title

b4e164b

ebsmothers reviewed Sep 27, 2024

View reviewed changes

SalmanMohammadi self-requested a review September 27, 2024 16:37

SalmanMohammadi approved these changes Sep 27, 2024

View reviewed changes

added kd losses to refs

d9212b2

formatting nits

4ab1789

ebsmothers approved these changes Sep 27, 2024

View reviewed changes

ebsmothers merged commit 7c4c629 into pytorch:main Sep 27, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge distillation tutorial #1698

Knowledge distillation tutorial #1698

lindawangg commented Sep 27, 2024 •

edited

Loading

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

lindawangg Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

lindawangg Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi Sep 27, 2024

SalmanMohammadi commented Sep 27, 2024

SalmanMohammadi Sep 27, 2024

lindawangg Sep 27, 2024

ebsmothers Sep 27, 2024

ebsmothers Sep 27, 2024

ebsmothers Sep 27, 2024

lindawangg Sep 27, 2024

SalmanMohammadi commented Sep 27, 2024

ebsmothers left a comment

	you can jump to knowledge distillation recipe in torchtune, `knowledge_distillation_single_device.py <https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py>`_.
	you can jump to the knowledge distillation recipe in torchtune, `knowledge_distillation_single_device.py <https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py>`_.


		.. image:: /_static/img/kd-simplified.png

		The total loss can be configured in many ways. The default KD config in torchtune combines CE loss with

	forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,
	the forward `Kullback-Leibler (KL) divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ loss,

	which is used in standard KD approaches.Forward KL divergence aims to minimize the difference by forcing the student
	which is used in standard KD approaches. Forward KL divergence aims to minimize the difference by forcing the student's

	you can see the implementation in `ForwardKLLoss <https://github.com/pytorch/torchtune/blob/4234b78b914af23384ce0348f564e2119d107a96/torchtune/modules/loss/kd_losses.py>`_.
	you can see the implementation in :class:`torchtune.modules.loss.ForwardKLLoss`.

	First, make sure that you have downloaded the Llama3 weights. For this example, we'll use the Llama3.1-8B as teacher and Llama3.2-1B as student.
	First, make sure that you have downloaded all the model weights. For this example, we'll use the Llama3.1-8B as teacher and Llama3.2-1B as student.


		tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth" --hf_token <HF_TOKEN>

		Then, we will fine-tune the teacher model with using LoRA. Based on our experiments and previous work,

		and `commonsense_qa <https://github.com/EleutherAI/lm-evaluation-harness/tree/b62b9bd/lm_eval/tasks/commonsense_qa>`_
		through `EleutherEval <https://github.com/EleutherAI/lm-evaluation-harness/tree/main>`_.


		tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device

		Ablation studies

	By default, the config has the learning rate as 3e-4, same as LoRA configs. For these experiments,
	By default, the config has the learning rate as ``3e-4`` which is the same as the LoRA configs. For these experiments,


		.. code-block:: bash

		tune run knowledge_distillation_single_device --config llama3_2/knowledge_distillation_single_device optimizer.lr=[LR]

	In the config, we have the kd_ratio as 0.5, which gives even weightings to both the class and KD loss. In these experiments,
	In the config, we have the ``kd_ratio`` as 0.5, which gives even weightings to both the class and KD loss. In these experiments,

	The KD recipe can also be applied to different model families as well. Here we look at the effect of KD when the number of
	The KD recipe can also be applied to different model families. Here we look at the effect of KD when the number of

	parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, which can be found in
	parameters between the teacher and student models are closer. For this experiment, we used Qwen2 1.5B and Qwen2 0.5B, the configs for which can be found in

	config. Here we see that alpaca_cleaned_dataset only improves truthful_qa performance and drops the metrics for the other evaluation tasks.
	config. Here we see that training on the alpaca cleaned dataset only improves truthful_qa performance and drops the metrics for the other evaluation tasks.

	`Knowledge Distillation <https://arxiv.org/pdf/1503.02531>`_ is is a widely used compression technique
	`Knowledge Distillation <https://arxiv.org/pdf/1503.02531>`_ is a widely used compression technique

	Knowledge is transferred from the teacher to student model by training it on a transfer set and where the
	Knowledge is transferred from the teacher to student model by training it on a transfer set where the

		you can see the implementation in :class:`torchtune.modules.loss.ForwardKLLoss`.
		By default, the KD configs use :class:`torchtune.modules.loss.ForwardKLWithChunkedOutputLoss` to reduce memory.

Knowledge distillation tutorial #1698

Knowledge distillation tutorial #1698

Conversation

lindawangg commented Sep 27, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Sep 27, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1698

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

SalmanMohammadi Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi commented Sep 27, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

lindawangg commented Sep 27, 2024 •

edited

Loading

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024 •

edited

Loading

SalmanMohammadi Sep 27, 2024 •

edited

Loading