Add GPTQ Quantization #1216

SunMarc · 2023-07-20T22:25:25Z

What does this PR do?

This PR adds the possibility to perform GTPQ quantization. I tried to be as generic as possible to support any kind of models ( whether it is Transformer models or not). The backend relies on auto_gptq library where we use GTPQ class and QuantLinear class. With this API you can do conversion, saving and loading quantized weights. We have a dependence on accelerate to save the weights and load the quantized weights efficiently without allocating more memory than needed.

Quantize model:

model_name = "bigscience/bloom-1b7"
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype = torch.float16, device_map = "auto")
quantizer = GPTQQuantizer(bits=4, dataset="c4", group_size=128, desc_act=False)
quantized_model = quantizer.quantize_model(quantized_model, tokenizer)

Save model

save_folder = 'bloom-1b7-quantized_optimum'
quantizer.save(model, save_folder)

Convert model and Load quantized weights

from accelerate import init_empty_weights
with init_empty_weights():
    empty_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
empty_model.tie_weights()
model_from_saved = load_quantized_model(empty_model, save_folder=save_folder, device_map="auto")

PS: In a follow-up PR, I will use this API to integrate GTPQ quantization into transformers, so that we take into account GPTQ quantized model, just like what we did for bitsandbytes

TODO:

Doc
CI
cpu offload (for now, if some modules are offloaded to the cpu (using device_map), the quantization will not work )
exllama integration
script to calculate the perplexity, maybe in a new PR ...

younesbelkada

Thanks a lot for this work and integration! Left some initial comments before the review of @fxmarty with respect to potential transformers integration later on

optimum/gptq/__init__.py

optimum/gptq/data.py

optimum/gptq/quantizer.py

optimum/gptq/constants.py

mfuntowicz · 2023-07-21T10:16:39Z

optimum/gptq/quantizer.py

+        if not torch.cuda.is_available():
+            raise RuntimeError("No GPU found. A GPU is needed to quantize model.")


Do we actually need a GPU to quantize the model? I don't have the details right out of my head now, but is it a strong requirement?

(Not considering the actual inference process which I understand requires a GPU, maybe quantization/calibration can be done on CPU too?)

I don't think it is a strong requirement. I need to test but you can have a look at fasterquant method from GPTQ class. Looks like it uses cuda to speed up some operations (cholesky, cholesky_inverse, matmul) and I can see that he calls torch.cuda.synchronize().

fxmarty

It is in good shape, thank you for working on it!

I still think that using exllama kernel would be very interesting (for ref https://github.com/fxmarty/q4f16-gemm-gemv-benchmark & huggingface/text-generation-inference#553 (comment), second one is with Triton, that I see is not used here), but integrating with AutoGPTQ is already a nice step (making sure that cuda-old is used when possible).

Is there a hard requirement on accelerate? If so, why?

I think it would be good to add a documentation as well in this PR, and CI.

It is also not clear to me what should go in accelerate and optimum, if this PR puts a hard requirement on accelerate. Why not put it in accelerate directly? Why bitsandbytes in accelerate & this in optimum? I think it can be quite confusing to users.

optimum/gptq/data.py

optimum/gptq/quantizer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

younesbelkada

Amazing work @SunMarc ! All this looks great! Left few nits but the overall structure looks really nice. Thanks a lot for working on this!
Let's wait for @fxmarty's review for merging this PR :D

docs/source/optimization_toolbox/usage_guides/quantization.mdx

.github/workflows/test_gptq.yml

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

fxmarty

Looks great! I left a few questions/style comments

docs/source/optimization_toolbox/usage_guides/quantization.mdx

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

fxmarty

LGTM, thank you for adding it!

Just left taste comments / answers to the threads above. I believe this breaks with transformers 4.31 now, but maybe that's fine

fxmarty · 2023-08-03T10:12:11Z

Can you run make style?

SunMarc · 2023-08-08T15:03:57Z

I believe this breaks with transformers 4.31 now, but maybe that's fine

I've changed it so that we install from source. I will change it back after the release of transformers.

regisss

Very nice PR @SunMarc 🔥 🚀
Just commenting the title of the new doc section.
Also, could you add a bullet point for GPTQ in this section please?

docs/source/_toctree.yml

docs/source/optimization_toolbox/usage_guides/quantization.mdx

younesbelkada and others added 15 commits July 12, 2023 10:51

v1 test draft

a1a33ef

code runs but outputs gibberish.

6ca0f12

draft v1.1

0fe030b

remove duplicate

71129bb

remove dep to transformers and cleaning

ac7023c

Add serialization and loading

4abb9b8

Clean code and doc

7150a97

add flexibility

2472804

remove triton

88dbe0e

remove some dep with transformers

90ec342

add testing

c7e49a0

make style

110c8c1

add accelerate flag

f64632f

handle device placement

ed9b743

make style

f65a979

SunMarc requested review from younesbelkada and fxmarty July 20, 2023 22:38

younesbelkada reviewed Jul 21, 2023

View reviewed changes

mfuntowicz reviewed Jul 21, 2023

View reviewed changes

fxmarty reviewed Jul 21, 2023

View reviewed changes

SunMarc and others added 10 commits July 21, 2023 09:47

Apply suggestions

7720b36

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

add doc in data.py

437329a

apply suggestion for utils file

cfe6239

remove multiple output

3254d6e

fix Optional

939e4ab

Apply suggestions from code review

e39f5b7

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

remove useless check

f8a25e2

fix doc and style

9afdbb4

fix name

e404bde

replace catcher by prefoward hook

89d18d6

SunMarc added 4 commits July 28, 2023 15:37

replace by tqdm.auto

c745309

Merge remote-tracking branch 'upstream/main' into add-gptq-marc

98591ab

change model

088f56f

add CI

4b019ea

SunMarc requested review from fxmarty and younesbelkada July 28, 2023 19:49

younesbelkada approved these changes Jul 31, 2023

View reviewed changes

SunMarc and others added 3 commits July 31, 2023 10:29

Apply suggestions from code review

49362ac

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update .github/workflows/test_gptq.yml

9de8918

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

add peft compatibility

ba9b2c9

fxmarty reviewed Aug 1, 2023

View reviewed changes

SunMarc and others added 2 commits August 1, 2023 09:56

Apply suggestions from code review doc

e255ca9

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

merge examples

b01bbfd

SunMarc requested a review from fxmarty August 1, 2023 14:46

SunMarc added 4 commits August 1, 2023 14:53

code review

62ac8bb

fix test

b0007fc

make style

19dff00

change var

15727f7

fxmarty approved these changes Aug 3, 2023

View reviewed changes

SunMarc added 2 commits August 8, 2023 15:19

fix doc

c506947

add exllama

744c249

SunMarc requested a review from fxmarty August 9, 2023 21:46

regisss reviewed Aug 9, 2023

View reviewed changes

docs/source/_toctree.yml Outdated Show resolved Hide resolved

change naming

66d7104

fxmarty reviewed Aug 10, 2023

View reviewed changes

docs/source/optimization_toolbox/usage_guides/quantization.mdx Outdated Show resolved Hide resolved

more doc

b43d6e0

SunMarc merged commit 9f2943e into huggingface:main Aug 10, 2023

Qubitium mentioned this pull request Nov 14, 2024

Replace auto_gptq by gptqmodel in HuggingFace/Optimum ModelCloud/GPTQModel#536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPTQ Quantization #1216

Add GPTQ Quantization #1216

SunMarc commented Jul 20, 2023 •

edited

Loading

younesbelkada left a comment

mfuntowicz Jul 21, 2023

mfuntowicz Jul 21, 2023

SunMarc Jul 21, 2023

fxmarty left a comment •

edited

Loading

younesbelkada left a comment

fxmarty left a comment

fxmarty left a comment

fxmarty commented Aug 3, 2023

SunMarc commented Aug 8, 2023 •

edited

Loading

regisss left a comment

		if not torch.cuda.is_available():
		raise RuntimeError("No GPU found. A GPU is needed to quantize model.")

Add GPTQ Quantization #1216

Add GPTQ Quantization #1216

Conversation

SunMarc commented Jul 20, 2023 • edited Loading

What does this PR do?

Quantize model:

Save model

Convert model and Load quantized weights

younesbelkada left a comment

Choose a reason for hiding this comment

mfuntowicz Jul 21, 2023

Choose a reason for hiding this comment

mfuntowicz Jul 21, 2023

Choose a reason for hiding this comment

SunMarc Jul 21, 2023

Choose a reason for hiding this comment

fxmarty left a comment • edited Loading

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

fxmarty left a comment

Choose a reason for hiding this comment

fxmarty left a comment

Choose a reason for hiding this comment

fxmarty commented Aug 3, 2023

SunMarc commented Aug 8, 2023 • edited Loading

regisss left a comment

Choose a reason for hiding this comment

SunMarc commented Jul 20, 2023 •

edited

Loading

fxmarty left a comment •

edited

Loading

SunMarc commented Aug 8, 2023 •

edited

Loading