Add Auto-Round support #581

yiliu30 · 2024-07-31T03:34:07Z

Resolve #533

Description

Integrated Auto-Round with quantize_ API using hooks + MultiTensor.
Exported the optimized qweight to AffineQuantizedTensor to leverage the tinygemm and Uintx kernels.
Evaluated the accuracy for Llama2/3/3.1 on 5 popular lm-eval tasks (more tests are on the way).
Added Auto-Round to the generation benchmarking for Llama2/3, (Llama 3.1 not yet tested as it was landed a few days ago).
Small fix for the Llama model Fixed the llama model #769

Usage

from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
    model,
    is_target_module=is_target_module,
    bits=4,
    group_size=128,
    iters=200,
    device=device,
)

input_ids_lst = []
for data in dataloader:
    input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

For E2E examples, please refer README.md

cc @thuang6 @ftian1 @wenhuach21

Signed-off-by: yiliu30 <yi4.liu@intel.com>

pytorch-bot · 2024-07-31T03:34:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96f745d with merge base 05224a9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

yiliu30 · 2024-07-31T03:41:57Z

Hi @jerryzh168 @msaroufim, I’m reaching out to request a preliminary review for this PR. Although some refactoring is still in progress. I’d like to get your feedback to ensure we’re on the right track before moving forward.

This draft PR includes:

1. A end-to-end example that quantizes the facebook/opt-125m with Auto-Round optimized qweight, scales zeros, and performs inference with torchao's ~~Int4WeightOnlyQuantizedLinearWeight~~ AffineQuantizedTensor.
2. Cleaned up the dependencies of auto-round in the patch-for-ao-2 branch.

Some TODOs:

3. Reduce the GPU memory consumption
4. Support other bits and data types (currently, the weight bits are hardcoded to 4, and activations are not quantized)
5. Further Refactoring auto-round
6. Rearrange the code structure

Regarding 3) GPU memory consumption, in the current flow, I use hooks to capture the inputs and outputs of each block during the calibration stage. This approach differs from the original auto-round's implementation, which captures only the input of the first decoder block and delays block inference to the quantize stage (similar to AutoAWQ's implementation). The implementation in this PR introduces some limitations: a) The GPU memory consumption is quite large when the calibration dataset is large. b) We cannot use the output of a previously quantized block as the input for the following block.

This approach is mainly to align with the static quantization flow and use quantize_ API. I wonder if you are willing to refactor the flow a bit to resolve these limitations, or if you have other suggestions? I think AutoAWQ might also need such adjustments.

torchao/prototype/autoround/core.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2024-08-01T09:36:48Z

Hi @jerryzh168, for 3), I noticed that GPTQ has a similar complication. #577

Instead, we want to run the model for each input, but ONLY up to the first linear, then pause, do the algorithm to update the weight, get the outputs for the updated weight and then, unpause and continue on until we hit the next linear….etc.

The main difference is that GPTQ handles a single Linear layer, whereas auto-round works on a decoder block (it may also work on a Linear layer when quantizing the lm-head).

Inspired by HDCharles's proposal, I tried to extend it to auto-round. Based on MultiTensor, the remaining issue is enabling the dispatcher to identify the decoder block, such as OPTDecoderLayer.
I resolved this by defining a customized operation called general_decoder and swapping all decoder blocks with it. Then, we perform the inference with the calibration dataset, when the dispatcher encounters general_decoder, it jumps to the auto-round's optimization process with all accumulated inputs and returns the outputs of the optimized model or original model.

I have prepared a full demo at here. Could you please take a look, thanks a lot!

jerryzh168 · 2024-08-01T22:12:34Z

@yiliu30 sorry for the late reply, I think using MultiInput from @HDCharles's GPTQ issue makes sense for your use case, since Auto-Round flow is similar to GPTQ flow but does not fit into the static quant flow (with observers) very well

jerryzh168 · 2024-08-01T22:14:55Z

one small nit for the "general_decoder", we can use

if func is torch.ops.transformers_ops.general_decoder:
   outputs = optimize_decoder(func, grouped_args, spec)

instead of looking at func.__name__

also after this is done, I think we can improve our current utils for operator implementation:

ao/torchao/dtypes/utils.py

Lines 11 to 70 in db345bd

    
           def _implements(cls, aten_ops_or_torch_fns): 
        
               """Use this decorator to implement a function for an aten ops in __torch_dispatch__ 
        
               (if user passed in a list of ops) 
        
               or torch function in __torch_function__ (if user passed in a single object) 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   implements = classmethod(_implements) 
        
               implements = MyTensor.implements 
        
               @implements(torch.nn.functional.linear): 
        
               def _(func, types, args, kwargs): 
        
                   ... 
        
               """ 
        
               if not hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE"): 
        
                   cls._ATEN_OP_OR_TORCH_FN_TABLE = {} 
        
               if not isinstance(aten_ops_or_torch_fns, (list, tuple)): 
        
                   aten_ops_or_torch_fns = [aten_ops_or_torch_fns] 
        
               def decorator(func): 
        
                   for op in aten_ops_or_torch_fns: 
        
                       @functools.wraps(op) 
        
                       def wrapper(*args, **kwargs): 
        
                           return func(*args, **kwargs) 
        
                       cls._ATEN_OP_OR_TORCH_FN_TABLE[op] = wrapper 
        
                   return func 
        
               return decorator 
        
           def _dispatch__torch_function__(cls, func, types, args=(), kwargs=None): 
        
               """Use this util function for a common `__torch_function__` implementation 
        
               that dispatches to ops/functions registered with `_implements` 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   __torch_function__ = classmethod(_dispatch__torch_function__) 
        
               """ 
        
               kwargs = {} if kwargs is None else kwargs 
        
               if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \ 
        
                  func in cls._ATEN_OP_OR_TORCH_FN_TABLE: 
        
                   return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs) 
        
               with torch._C.DisableTorchFunctionSubclass(): 
        
                   return func(*args, **kwargs) 
        
           def _dispatch__torch_dispatch__(cls, func, types, args, kwargs): 
        
               """Use this util function for a common `__torch_dispatch__` implementation 
        
               that dispatches to ops/functions registered with `_implements` 
        
               class MyTensor(torch.Tensor): 
        
                   ... 
        
                   __torch_dispatch__ = classmethod(_dispatch__torch_dispatch__) 
        
               """ 
        
               if hasattr(cls, "_ATEN_OP_OR_TORCH_FN_TABLE") and \ 
        
                  func in cls._ATEN_OP_OR_TORCH_FN_TABLE: 
        
                   return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, *args, **kwargs) 
        
               raise NotImplementedError(f"{cls.__name__} dispatch: attempting to run unimplemented operator/function: {func}")

and incorporate this use case so you can reduce boilerplate code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

torchao/prototype/autoround/core.py

jerryzh168

requested some changes

Signed-off-by: yiliu30 <yi4.liu@intel.com>

--------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

torchao/prototype/autoround/multi_tensor.py

torchao/prototype/autoround/README.md

wenhuach21 · 2024-08-26T04:32:39Z

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

jerryzh168 · 2024-08-27T20:13:19Z

I was curious about the compute dtype supported by the AO kernel. If it only supports FP16, I recommend forcing the dtype to FP16 before passing it to AutoRound. However, if BF16 is also supported, it would be preferable to set the scale_type in AutoRound to align with the original model.

Additionally, the accuracy data slightly differs from the results of our recipe, which may not be solely due to changes in hyperparameters. We should investigate this further.

it depends on the kernel, int4 weight only that uses tinygemm kernel only supports bfloat16 I think

torchao/_models/llama/model.py

jerryzh168 · 2024-08-27T20:15:27Z

torchao/prototype/autoround/README.md

+quantize_(model, apply_auto_round(), is_target_module)
+```
+
+## End-to-End Results


so what about performance results?

jerryzh168

code changes looks good to me, one comment is just to include performance data (token/s, memory etc.) in README as well, similar to https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2024-08-28T05:25:37Z

The benchmark depends on #769

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <yi4.liu@intel.com>

jerryzh168 · 2024-09-04T00:27:35Z

torchao/_models/llama/generate.py

+                )
+            else:
+                is_target_module = lambda mod, fqn: isinstance(mod, TransformerBlock)
+            quantize_model_with_autoround_(


nit: should we just use the same flow everywhere to reduce confusions, the flow in https://github.com/pytorch/ao/pull/581/files#diff-af129d63635a3b5b0a95f1a3831f852fbd7bedfd66b38d41bf4975fb49aad246 would be the recommended one I think

jerryzh168 · 2024-09-04T00:58:38Z

Thanks @yiliu30 for addressing all the comments!

yiliu30 · 2024-09-04T01:25:36Z

@jerryzh168 Thanks for your patient guidance and detailed examples. This joint effort will allow more users to benefit from AO and auto-round!

* initial flow for autoround Signed-off-by: yiliu30 <yi4.liu@intel.com> * update flow Signed-off-by: yiliu30 <yi4.liu@intel.com> * use int4 kernel Signed-off-by: yiliu30 <yi4.liu@intel.com> * remove debug code Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the forward Signed-off-by: yiliu30 <yi4.liu@intel.com> * clean code Signed-off-by: yiliu30 <yi4.liu@intel.com> * e2e example Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine code Signed-off-by: yiliu30 <yi4.liu@intel.com> * add requirements for test Signed-off-by: yiliu30 <yi4.liu@intel.com> * update test Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the readme Signed-off-by: yiliu30 <yi4.liu@intel.com> * add readme Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the filenames Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the np version Signed-off-by: yiliu30 <yi4.liu@intel.com> * add demo Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * add more docs Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * add doc Signed-off-by: yiliu30 <yi4.liu@intel.com> * use `AffineQuantizedTensor` Signed-off-by: yiliu30 <yi4.liu@intel.com> * impl ar using multensors Signed-off-by: yiliu30 <yi4.liu@intel.com> * clean code Signed-off-by: yiliu30 <yi4.liu@intel.com> * use hook + multensors Signed-off-by: yiliu30 <yi4.liu@intel.com> * separate mul_tensors into a new file Signed-off-by: yiliu30 <yi4.liu@intel.com> * fix typos Signed-off-by: yiliu30 <yi4.liu@intel.com> * rename mul_tensor to multi_tensor Signed-off-by: yiliu30 <yi4.liu@intel.com> * enable amp Signed-off-by: yiliu30 <yi4.liu@intel.com> * eval model Signed-off-by: yiliu30 <yi4.liu@intel.com> * add gen examples Signed-off-by: yiliu30 <yi4.liu@intel.com> * add warmup to benchmark Signed-off-by: yiliu30 <yi4.liu@intel.com> * add benchmark Signed-off-by: yiliu30 <yi4.liu@intel.com> * clean code Signed-off-by: yiliu30 <yi4.liu@intel.com> * format code Signed-off-by: yiliu30 <yi4.liu@intel.com> * use tiny kernel Signed-off-by: yiliu30 <yi4.liu@intel.com> * add more note Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * correct typos Signed-off-by: yiliu30 <yi4.liu@intel.com> * remove hard code Signed-off-by: yiliu30 <yi4.liu@intel.com> * use intx Signed-off-by: yiliu30 <yi4.liu@intel.com> * enable offload for multitensor Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the default config Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine note Signed-off-by: yiliu30 <yi4.liu@intel.com> * update the version check Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * update Signed-off-by: yiliu30 <yi4.liu@intel.com> * add ut Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * add scripts Signed-off-by: yiliu30 <yi4.liu@intel.com> * format code Signed-off-by: yiliu30 <yi4.liu@intel.com> * format Signed-off-by: yiliu30 <yi4.liu@intel.com> * update Signed-off-by: yiliu30 <yi4.liu@intel.com> * fix typo Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine bench code Signed-off-by: yiliu30 <yi4.liu@intel.com> * Enable `use_optimized_layer_output` and AO' llama (pytorch#12) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Refine the Doc (pytorch#14) --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> * add more docstring Signed-off-by: yiliu30 <yi4.liu@intel.com> * add paper link Signed-off-by: yiliu30 <yi4.liu@intel.com> * correct some note Signed-off-by: yiliu30 <yi4.liu@intel.com> * add cmd Signed-off-by: yiliu30 <yi4.liu@intel.com> * udpdate the scripts Signed-off-by: yiliu30 <yi4.liu@intel.com> * revert some change Signed-off-by: yiliu30 <yi4.liu@intel.com> * Add a lightweight configuration for quick benchmarking (pytorch#15) Signed-off-by: yiliu30 <yi4.liu@intel.com> * update quant method name Signed-off-by: yiliu30 <yi4.liu@intel.com> * Wrap model's buffers and params to `MultiTensor` & update the results (pytorch#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <yi4.liu@intel.com> --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 19 commits July 24, 2024 02:56

initial flow for autoround

be78a08

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update flow

49f8075

Signed-off-by: yiliu30 <yi4.liu@intel.com>

use int4 kernel

62834a2

Signed-off-by: yiliu30 <yi4.liu@intel.com>

remove debug code

6433e75

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update the forward

65f46e5

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

1e22c11

Signed-off-by: yiliu30 <yi4.liu@intel.com>

e2e example

b8d37b9

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine code

8d388fb

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add requirements for test

07a95a0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update test

6baa62f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update the readme

78a5067

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add readme

37e9f5f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update the filenames

8bfe76a

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update the np version

e25d6eb

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add demo

16a901d

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

5f16e8d

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add more docs

f3442c5

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

432da79

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add doc

7ee9f9b

Signed-off-by: yiliu30 <yi4.liu@intel.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024

jerryzh168 reviewed Jul 31, 2024

View reviewed changes

torchao/prototype/autoround/core.py Outdated Show resolved Hide resolved

use AffineQuantizedTensor

e5ffcca

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 2 commits August 8, 2024 04:44

impl ar using multensors

cec375b

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

a8f5681

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 commented Aug 8, 2024

View reviewed changes

torchao/prototype/autoround/core.py Outdated Show resolved Hide resolved

jerryzh168 requested changes Aug 23, 2024

View reviewed changes

yiliu30 added 3 commits August 24, 2024 23:06

Merge branch 'main' into auto_round_support-3

fabe8d2

Enable use_optimized_layer_output and AO' llama (#12)

9ae5392

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Refine the Doc (#14)

157c189

--------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 commented Aug 26, 2024

View reviewed changes

torchao/prototype/autoround/multi_tensor.py Show resolved Hide resolved

yiliu30 commented Aug 26, 2024

View reviewed changes

torchao/prototype/autoround/README.md Show resolved Hide resolved

torchao/prototype/autoround/README.md Show resolved Hide resolved

yiliu30 requested a review from jerryzh168 August 26, 2024 03:52

yiliu30 added 4 commits August 26, 2024 00:47

add more docstring

2df3f5f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add paper link

d719460

Signed-off-by: yiliu30 <yi4.liu@intel.com>

correct some note

d7ba39e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add cmd

a2c6b28

Signed-off-by: yiliu30 <yi4.liu@intel.com>

jerryzh168 reviewed Aug 27, 2024

View reviewed changes

torchao/_models/llama/model.py Show resolved Hide resolved

jerryzh168 reviewed Aug 27, 2024

View reviewed changes

jerryzh168 approved these changes Aug 27, 2024

View reviewed changes

resolve conflicts

896d87f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 5 commits August 28, 2024 06:18

udpdate the scripts

6a8e073

Signed-off-by: yiliu30 <yi4.liu@intel.com>

revert some change

9e48d1a

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Add a lightweight configuration for quick benchmarking (#15)

5ca125e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

merge with main

b6d95ce

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update quant method name

21686f1

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 mentioned this pull request Sep 3, 2024

Fixed the llama model #769

Merged

Wrap model's buffers and params to MultiTensor & update the results (…

96f745d

…#16) * wrap model's buffers and params to `MultiTensor` and update the results Signed-off-by: yiliu30 <yi4.liu@intel.com>

jerryzh168 reviewed Sep 4, 2024

View reviewed changes

jerryzh168 merged commit f5703b0 into pytorch:main Sep 4, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Auto-Round support #581

Add Auto-Round support #581

yiliu30 commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Aug 1, 2024 •

edited

Loading

jerryzh168 commented Aug 1, 2024

jerryzh168 commented Aug 1, 2024

jerryzh168 left a comment

wenhuach21 commented Aug 26, 2024

jerryzh168 commented Aug 27, 2024

jerryzh168 Aug 27, 2024

jerryzh168 left a comment •

edited

Loading

yiliu30 commented Aug 28, 2024 •

edited

Loading

jerryzh168 Sep 4, 2024

jerryzh168 commented Sep 4, 2024

yiliu30 commented Sep 4, 2024

Add Auto-Round support #581

Add Auto-Round support #581

Conversation

yiliu30 commented Jul 31, 2024 • edited Loading

Description

Usage

pytorch-bot bot commented Jul 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/581

✅ No Failures

yiliu30 commented Jul 31, 2024 • edited Loading

yiliu30 commented Aug 1, 2024 • edited Loading

jerryzh168 commented Aug 1, 2024

jerryzh168 commented Aug 1, 2024

jerryzh168 left a comment

Choose a reason for hiding this comment

wenhuach21 commented Aug 26, 2024

jerryzh168 commented Aug 27, 2024

jerryzh168 Aug 27, 2024

Choose a reason for hiding this comment

jerryzh168 left a comment • edited Loading

Choose a reason for hiding this comment

yiliu30 commented Aug 28, 2024 • edited Loading

jerryzh168 Sep 4, 2024

Choose a reason for hiding this comment

jerryzh168 commented Sep 4, 2024

yiliu30 commented Sep 4, 2024

yiliu30 commented Jul 31, 2024 •

edited

Loading

pytorch-bot bot commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Jul 31, 2024 •

edited

Loading

yiliu30 commented Aug 1, 2024 •

edited

Loading

jerryzh168 left a comment •

edited

Loading

yiliu30 commented Aug 28, 2024 •

edited

Loading