You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
There is an outstanding issue where model support is not guaranteed for inference even if the GPTQ QuantLinear selection matches all bits, sym, group_size, desc_act properties. User has no way of knowing until they run the model, even if correctly quantized, and crashes with little ability to auto-correct.
Bug origin: Most of the GPTQ kernels are optimized/written where specific divisibility conditions must be met when it comes to in_features, out_features, group_size. This has been mostly fixed/bypassed in GPTQModel and a revised solution is necessary for transformer/optimum.
@SunMarc We plan to submit a PR/proposal to fix this after PR ##2064 as our proposed changes will further complicate/delay the review process.
Expected behavior
All gptq qunatized models on HF should run without error.
The text was updated successfully, but these errors were encountered:
In addition to the first post on technical requirements imposed by the kernels, there are assumptions in the current GPTQ code in optimum/transformers that may not make sense moving forward:
Single QuantLinear class for entire model
Single quantization config applied to entire model
Both are actually not guaranteed to be singular in real world. Models like Hymba already demonstrates that layers are no longer single repeats of previous layer. Quantization already happens at the sub-module level, for the most part, for each layer but control is not there in optimum/peft/transformers. In GPTQModel we already have a dynamic config within config that controls:
If a module should be quantized at all
Does the module quantize at global level quantize config (inherit base) or override with different config params
Allow per-module differential quantization vs base
Allow different quant_linear kernels to execute per model since control is now per module, not per model.
So in general, we are going to propose and implement, already implemented in GPTQModel, per-module control of quantization and kernel selection.
System Info
Who can help?
@SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
There is an outstanding issue where model support is not guaranteed for inference even if the GPTQ QuantLinear selection matches all
bits
,sym
,group_size
,desc_act
properties. User has no way of knowing until they run the model, even if correctly quantized, and crashes with little ability to auto-correct.Bug origin: Most of the GPTQ kernels are optimized/written where specific divisibility conditions must be met when it comes to
in_features
,out_features
,group_size
. This has been mostly fixed/bypassed in GPTQModel and a revised solution is necessary for transformer/optimum.@SunMarc We plan to submit a PR/proposal to fix this after PR ##2064 as our proposed changes will further complicate/delay the review process.
Expected behavior
All gptq qunatized models on HF should run without error.
The text was updated successfully, but these errors were encountered: