Adding quantization support in torchtune #632

jerryzh168 · 2024-04-01T21:33:17Z

Summary:
Allows user to specify quantization_mode in generating model in full_finetune_single_device.py and inference with the quantized model in generate.py

Test Plan:
tested locally

may need changes to corresponding yaml files, see README.md for more info

tune run full_finetune_single_device --config llama2/7B_full_single_device max_steps_per_epoch=4 epochs=1
tune run quantize --config quantize
tune run generate --config generate
tune run eleuther_eval --config eleuther_eval

Results of generate for int4 weight only quantized model:

$ tune run generate --config quant_generate                                                                                                         
2024-04-04:19:23:15,756 INFO     [_parse.py:52] Running main with parameters {'model': {'_component_': 'torchtune.models.llama2.llama2_7b'}, 'checkpointer': {'_component_': 'torchtune.utils.FullModelTorchTuneCheckpointer', 'checkpoint_dir': '/tmp/llama2/', 'checkpoint_files': ['meta_model_0.4w.pt'], 'output_dir': '/tmp/llama2/', 'model_type': 'LLAMA2'}, 'device': 'cuda', 'dtype': 'bf16', 'seed': 1234, 'quantizer': {'_compone
nt_': 'torchtune.utils.quantization.Int4WeightOnlyQuantizer', 'groupsize': 256}, 'tokenizer': {'_component_': 'torchtune.models.llama2.llama2_tokenizer', 'path': '/tmp/llama2/tokenizer.model'}, 'prompt': 'Hello, my
 name is', 'max_new_tokens': 300, 'temperature': 0.8, 'top_k': 300}
2024-04-04:19:23:16,140 DEBUG    [seed.py:59] Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
linear: layers.0.attn.q_proj, in=4096, out=4096
linear: layers.0.attn.k_proj, in=4096, out=4096
linear: layers.0.attn.v_proj, in=4096, out=4096
linear: layers.0.attn.output_proj, in=4096, out=4096
linear: layers.0.mlp.w1, in=4096, out=11008
linear: layers.0.mlp.w2, in=11008, out=4096
…
linear: output, in=4096, out=32000
2024-04-04:19:23:26,511 INFO     [generate.py:68] Model is initialized with precision torch.bfloat16.
2024-04-04:19:23:39,668 INFO     [generate.py:92] Hello, my name is Alicia, I’m a 47-year-old married woman whose husband is older than me. myself. I don’t have children and I’m not expecting any. If you are already 18 years old, and you plan to have a baby with me! I’m an artist and I like to be creative (especially with jewelry and beading). I’m a little clumsy and I don’t know how to drive, but I’m good at doing girly things, lol.
I like to have fun in private, so I’m looking for a guy who likes to have sex and is funny.
I am for private relationships and I understand the importance of loyalty, so I expect the same from you!
I’m here to have fun, so let’s meet and know each other and let’s see if we’ll get along!
British, Europe
Art, Cinema, History, Reading, Shopping, Sleeping
Chillin, Comedy, Conversation, Going out, Intelligence, Movies, Partying
Clothes, Creative, Goal Oriented, Intellectual, Money
2024-04-04:19:23:39,669 INFO     [generate.py:96] Time for inference: 12.82 sec total, 20.44 tokens/sec
2024-04-04:19:23:39,669 INFO     [generate.py:99] Memory used: 17.85 GB

Results of eval for int4 weight only quantized model:

$ tune run eleuther_eval --config eleuther_eval
2024-04-04:19:26:20,675 INFO     [_parse.py:52] Running recipe_main with parameters {'model': {'_component_': 'torchtune.models.llama2.llama2_7b'}, 'checkpointer': {'_component_': 'torchtune.utils.FullModelTorchTun
eCheckpointer', 'checkpoint_dir': '/tmp/llama2/', 'checkpoint_files': ['meta_model_0.4w.pt'], 'output_dir': '/tmp/llama2/', 'model_type': 'LLAMA2'}, 'tokenizer': {'_component_': 'torchtune.models.llama2.llama2_toke
nizer', 'path': '/tmp/llama2/tokenizer.model'}, 'device': 'cuda', 'dtype': 'bf16', 'seed': 217, 'tasks': ['wikitext'], 'limit': None, 'max_seq_length': 4096, 'quantizer': {'_component_': 'torchtune.utils.quantizati
on.Int4WeightOnlyQuantizer', 'groupsize': 256}}
2024-04-04:19:26:21,029 DEBUG    [seed.py:59] Setting manual seed to local seed 217. Local seed is seed + rank = 217 + 0
linear: layers.0.attn.q_proj, in=4096, out=4096
linear: layers.0.attn.k_proj, in=4096, out=4096
linear: layers.0.attn.v_proj, in=4096, out=4096
linear: layers.0.attn.output_proj, in=4096, out=4096
linear: layers.0.mlp.w1, in=4096, out=11008
linear: layers.0.mlp.w2, in=11008, out=4096
...
linear: output, in=4096, out=32000
2024-04-04:19:26:27,681 INFO     [eleuther_eval.py:167] Model is initialized with precision torch.bfloat16.
2024-04-04:19:26:27,699 INFO     [eleuther_eval.py:151] Tokenizer is initialized from file.
2024-04-04:19:26:28,036 INFO     [huggingface.py:162] Using device 'cuda:0'
q2024-04-04:19:26:35,919 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-04-04:19:26:35,919 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-04-04:19:26:35,919 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-04-04:19:26:35,919 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-04-04:19:26:35,919 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-04-04:19:26:35,919 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
2024-04-04:19:26:37,607 INFO     [eleuther_eval.py:188] Running evaluation on ['wikitext'] tasks.
2024-04-04:19:26:37,610 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 487.84it/s]
2024-04-04:19:26:37,743 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [04:53<00:00,  4.73s/it]
2024-04-04:19:31:31,458 INFO     [eleuther_eval.py:195] Eval completed in 303.43 seconds.
2024-04-04:19:31:31,458 INFO     [eleuther_eval.py:197] wikitext: {'word_perplexity,none': 9.615681846101303, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5269407647819768, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.610644096180032, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-04-01T21:33:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/632

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4abcdb9 with merge base 45031b3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchtune/utils/__init__.py

recipes/eleuther_eval.py

recipes/quantize.py

kartikayk · 2024-04-03T15:23:46Z

recipes/quantize.py

+        # Ensure the cache is setup on the right device
+        with self._device:
+            model.setup_caches(max_batch_size=1, dtype=self._dtype)


We don't need to set up caches for quantization?

recipes/quantize.py

kartikayk · 2024-04-03T15:35:17Z

torchtune/utils/quantization.py

@@ -0,0 +1,47 @@
+from typing import Any


Can we add some tests for this file? I'm nervous about adding utilities without any associated tests

We should also add some tests to make sure the model is as expected after quantization. This will help catch any breakages in APIs/behaviors if torchao changes.

@rohan-varma any ideas around testing?

@kartikayk I changed these to just use classes directly and they will be tested in torchao, is that OK?

kartikayk · 2024-04-03T15:37:07Z

recipes/configs/quant_generate.yaml

@@ -0,0 +1,29 @@
+


I dont think we need a separate config here. We should make it clear in our documentation and tutorial on how to load a quantized model. Its just a checkpointer change. So Id remove this

recipes/eleuther_eval.py

kartikayk · 2024-04-03T15:38:42Z

recipes/eleuther_eval.py

@@ -124,20 +124,25 @@ class EleutherEvalRecipe(EvalRecipeInterface):
    def __init__(self, cfg: DictConfig) -> None:
        self._cfg = cfg

-    def load_checkpoint(self, checkpointer_cfg: DictConfig) -> Dict[str, Any]:
+    def load_checkpoint(self, checkpointer_cfg: DictConfig, weights_only: bool = True) -> Dict[str, Any]:


@joecummings can I get a review for this file? Trying to think about the best way to integrate the quantization changes for eval

Noob q: why would we want to quantize the model in eval?

This is mostly to load a quantized model for evaluation. The flow here is something like this:

Finetune + Eval

Quantize

Eval with quantize to make sure quantized model is still performant

Run inference with quantized model to make sure its not doing something crazy

any updates on this? @kartikayk @joecummings

recipes/configs/quant_generate.yaml

ebsmothers · 2024-04-03T15:50:19Z

recipes/quantize.py

+    def __init__(self, cfg: DictConfig) -> None:
+        self._device = utils.get_device(device=cfg.device)
+        self._dtype = utils.get_dtype(dtype=cfg.dtype)
+        self._quantization_mode = cfg.quantization_mode


Related to a couple of Kartikay's comments, should be very explicit about what the supported quantization modes are here.

I added some docs in quantize.yaml file, please let me know if that's a good place to host it

recipes/configs/generate.yaml

recipes/README.md

kartikayk · 2024-04-04T15:01:42Z

recipes/README.md

+# Quantization specific args
+quantizer:
+  _component_: torchtune.utils.Int4WeightOnlyQuantizer
+  groupsize: 256


Thanks for adding this! Do you mind just adding some comments on explaining what these are here. Ok to point to the doc string for more info, but add a line or two around what this block is doing.

added these in receipes/configs/quantize.yaml

recipes/README.md

recipes/configs/eleuther_eval.yaml

kartikayk · 2024-04-04T15:04:31Z

recipes/eleuther_eval.py

@@ -33,6 +33,16 @@
    sys.exit(1)


+def _parent_name(target):


Where is this used? I didn't find this in the file below

oh will remove, this was used for workaround issues in tensor subclass

recipes/generate.py

kartikayk · 2024-04-04T15:05:51Z

recipes/quantize.py

+        return model
+
+    @torch.no_grad()
+    def quantize(self, cfg: DictConfig):


So when will the failure be surfaced? At the time the config is parsed?

kartikayk

This looks awesome! A few questions and suggestions.

Mind adding some screenshots for:
a) Eval with quantized model on any one task
b) Screenshot of inference with the prompt including memory consumption and tokens/sec.

Both of these will be helpful in the future and act as a reference.

kartikayk · 2024-04-05T01:46:11Z

recipes/README.md

+
+`receipes/configs/quantize.yaml`
+
+We'll publish doc pages for different quantizers in torchao a bit later. For int4 weight only gptq quantizer, here is a brief description of what each argument menas:


Suggested change

We'll publish doc pages for different quantizers in torchao a bit later. For int4 weight only gptq quantizer, here is a brief description of what each argument menas:

We'll publish doc pages for different quantizers in torchao a bit later. For int4 weight only gptq quantizer, here is a brief description of what each argument means:

Also, seems like this is missing some information i.e. I assumed there will be a description of what each argument is :)

oh sorry I moved these to quantized.yaml, I'll reword this

kartikayk · 2024-04-05T01:47:16Z

recipes/README.md

+`recipes/eleuther_eval.py`
+
+```
+# to skip running through GPTQ, change model = quantizer.quantize(model) to:


what does "running through GPTQ" mean?

OK I'll change this a bit, also will add a bit description for GPTQ

recipes/configs/eleuther_eval.yaml

kartikayk · 2024-04-05T01:48:52Z

recipes/configs/quantize.yaml

+#        Args:
+#          `groupsize` (int): a parameter of int4 weight only quantization,
+#           it refers to the size of quantization groups which get independent quantization parameters
+#           e.g. 32, 64, 128, 256, smaller numbers means more fine grained and higher accuracy


I'm guessing smaller numbers also means more memory? Is that right?

yeah smaller numbers will cost more memory, but it's probably not too significant overall

kartikayk · 2024-04-05T01:51:00Z

recipes/configs/quantize.yaml

+#            multiple of groupsize.
+#            `percdamp`: GPTQ stablization hyperparameter, recommended to be .01
+#
+#        future note: blocksize and percdamp should not have to be 'known' by users by default.


Since most users don't need it, should we just remove this from the config and add these as defaults to the instantiate call? Or maybe not even expose these at all? WDYT?

makes sense, maybe we can remove these in a future release? we have done branch cut today. cc @HDCharles

kartikayk · 2024-04-05T01:54:27Z

recipes/eleuther_eval.py

@@ -150,10 +153,15 @@ def _setup_model(
        model_cfg: DictConfig,
        model_state_dict: Dict[str, Any],
    ) -> nn.Module:
-        with utils.set_default_dtype(self._dtype), self._device:
+        if self._quantization_mode is not None:


This means that we'll init the model in fp32 instead of say bf16. Is that by design? I only ask because this will double the model memory at init time. Is quantization from bf16 not supported?

oh it is supported, we could init with bf16. I can change this back to the init under self._dtype and device

kartikayk · 2024-04-05T01:56:36Z

recipes/eleuther_eval.py

+            model.load_state_dict(model_state_dict)
+        else:
+            with utils.set_default_dtype(self._dtype), self._device:
+                model = config.instantiate(model_cfg)
+            model.load_state_dict(model_state_dict)


Suggested change

model.load_state_dict(model_state_dict)

else:

with utils.set_default_dtype(self._dtype), self._device:

model = config.instantiate(model_cfg)

model.load_state_dict(model_state_dict)

else:

with utils.set_default_dtype(self._dtype), self._device:

model = config.instantiate(model_cfg)

model.load_state_dict(model_state_dict)

model.load_state_dict(model_state_dict) can be just moved outside the if-else block. If we can init the model in bf16 for quantizaton then this if-else block can be further simplified.

sounds good

recipes/generate.py

kartikayk · 2024-04-05T01:59:17Z

recipes/generate.py

+        if self._quantization_mode is not None:
+            model = self._quantizer.quantize(model)
+            model = model.to(device=self._device, dtype=self._dtype)


So seems like init the model in bf16 is fine? Can we do the same for eval too?

kartikayk · 2024-04-05T02:00:32Z

requirements.txt

@@ -9,4 +9,4 @@ tqdm
 omegaconf

 # Quantization
-torchao-nightly==2024.3.29
+torchao==0.1


Summary: Allows user to specify quantization_mode in generating model in full_finetune_single_device.py and inference with the quantized model in generate.py Test Plan: tested locally Reviewers: Subscribers: Tasks: Tags:

kartikayk

Thanks for adding this functionality and for patiently addressing all of the comments!

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2024

jerryzh168 force-pushed the add-quant branch 3 times, most recently from f780b1b to 47bd46d Compare April 3, 2024 02:01

jerryzh168 commented Apr 3, 2024

View reviewed changes

torchtune/utils/__init__.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the add-quant branch 2 times, most recently from 434fa6d to b285a16 Compare April 3, 2024 04:14

jerryzh168 requested a review from kartikayk April 3, 2024 04:14

kartikayk reviewed Apr 3, 2024

View reviewed changes

ebsmothers reviewed Apr 3, 2024

View reviewed changes

recipes/configs/quant_generate.yaml Outdated Show resolved Hide resolved

ebsmothers reviewed Apr 3, 2024

View reviewed changes

recipes/configs/generate.yaml Outdated Show resolved Hide resolved

jerryzh168 force-pushed the add-quant branch 3 times, most recently from 52b7a4c to f794e13 Compare April 4, 2024 05:07

kartikayk reviewed Apr 4, 2024

View reviewed changes

kartikayk mentioned this pull request Apr 4, 2024

Adding quantization support in torchtune #653

Closed

jerryzh168 force-pushed the add-quant branch 3 times, most recently from 7137b6d to b4182ac Compare April 4, 2024 22:37

jerryzh168 requested review from joecummings, kartikayk and ebsmothers April 4, 2024 22:38

jerryzh168 force-pushed the add-quant branch from b4182ac to e61d072 Compare April 4, 2024 22:40

jerryzh168 mentioned this pull request Apr 4, 2024

Experiment with int8 weight only quantization #603

Closed

jerryzh168 force-pushed the add-quant branch 2 times, most recently from 3be5b97 to 1907236 Compare April 5, 2024 01:38

kartikayk reviewed Apr 5, 2024

View reviewed changes

jerryzh168 force-pushed the add-quant branch from 1907236 to 77227e6 Compare April 5, 2024 02:18

Adding quantization support in torchtune

4abcdb9

Summary: Allows user to specify quantization_mode in generating model in full_finetune_single_device.py and inference with the quantized model in generate.py Test Plan: tested locally Reviewers: Subscribers: Tasks: Tags:

jerryzh168 force-pushed the add-quant branch from 77227e6 to 4abcdb9 Compare April 5, 2024 02:40

kartikayk approved these changes Apr 5, 2024

View reviewed changes

kartikayk merged commit 2ab2721 into main Apr 5, 2024
20 checks passed

tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024

Adding quantization support in torchtune (pytorch#632)

e97720a

joecummings deleted the add-quant branch April 11, 2024 15:40


		`receipes/configs/quantize.yaml`

		We'll publish doc pages for different quantizers in torchao a bit later. For int4 weight only gptq quantizer, here is a brief description of what each argument menas:

Adding quantization support in torchtune #632

Adding quantization support in torchtune #632

Conversation

jerryzh168 commented Apr 1, 2024 • edited Loading

pytorch-bot bot commented Apr 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/632

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

jerryzh168 commented Apr 1, 2024 •

edited

Loading

pytorch-bot bot commented Apr 1, 2024 •

edited

Loading

jerryzh168 Apr 4, 2024 •

edited

Loading