-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPTQ Quantization #1216
Add GPTQ Quantization #1216
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this work and integration! Left some initial comments before the review of @fxmarty with respect to potential transformers integration later on
if not torch.cuda.is_available(): | ||
raise RuntimeError("No GPU found. A GPU is needed to quantize model.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need a GPU to quantize the model? I don't have the details right out of my head now, but is it a strong requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not considering the actual inference process which I understand requires a GPU, maybe quantization/calibration can be done on CPU too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a strong requirement. I need to test but you can have a look at fasterquant
method from GPTQ
class. Looks like it uses cuda
to speed up some operations (cholesky
, cholesky_inverse
, matmul
) and I can see that he calls torch.cuda.synchronize()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in good shape, thank you for working on it!
I still think that using exllama kernel would be very interesting (for ref https://github.com/fxmarty/q4f16-gemm-gemv-benchmark & huggingface/text-generation-inference#553 (comment), second one is with Triton, that I see is not used here), but integrating with AutoGPTQ is already a nice step (making sure that cuda-old
is used when possible).
Is there a hard requirement on accelerate? If so, why?
I think it would be good to add a documentation as well in this PR, and CI.
It is also not clear to me what should go in accelerate and optimum, if this PR puts a hard requirement on accelerate. Why not put it in accelerate directly? Why bitsandbytes in accelerate & this in optimum? I think it can be quite confusing to users.
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left a few questions/style comments
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for adding it!
Just left taste comments / answers to the threads above. I believe this breaks with transformers 4.31 now, but maybe that's fine
Can you run |
I've changed it so that we install from source. I will change it back after the release of transformers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR @SunMarc 🔥 🚀
Just commenting the title of the new doc section.
Also, could you add a bullet point for GPTQ in this section please?
What does this PR do?
This PR adds the possibility to perform GTPQ quantization. I tried to be as generic as possible to support any kind of models ( whether it is Transformer models or not). The backend relies on auto_gptq library where we use
GTPQ
class andQuantLinear
class. With this API you can do conversion, saving and loading quantized weights. We have a dependence onaccelerate
to save the weights and load the quantized weights efficiently without allocating more memory than needed.Quantize model:
Save model
Convert model and Load quantized weights
PS: In a follow-up PR, I will use this API to integrate GTPQ quantization into transformers, so that we take into account GPTQ quantized model, just like what we did for
bitsandbytes
TODO: