GPTQ kernel inference not compatible with some models #2120

Qubitium · 2024-12-07T00:14:27Z

System Info

Any

Who can help?

@SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

There is an outstanding issue where model support is not guaranteed for inference even if the GPTQ QuantLinear selection matches all bits, sym, group_size, desc_act properties. User has no way of knowing until they run the model, even if correctly quantized, and crashes with little ability to auto-correct.

Bug origin: Most of the GPTQ kernels are optimized/written where specific divisibility conditions must be met when it comes to in_features, out_features, group_size. This has been mostly fixed/bypassed in GPTQModel and a revised solution is necessary for transformer/optimum.

@SunMarc We plan to submit a PR/proposal to fix this after PR ##2064 as our proposed changes will further complicate/delay the review process.

Expected behavior

All gptq qunatized models on HF should run without error.

The text was updated successfully, but these errors were encountered:

Qubitium · 2024-12-10T03:05:54Z

In addition to the first post on technical requirements imposed by the kernels, there are assumptions in the current GPTQ code in optimum/transformers that may not make sense moving forward:

Single QuantLinear class for entire model
Single quantization config applied to entire model

Both are actually not guaranteed to be singular in real world. Models like Hymba already demonstrates that layers are no longer single repeats of previous layer. Quantization already happens at the sub-module level, for the most part, for each layer but control is not there in optimum/peft/transformers. In GPTQModel we already have a dynamic config within config that controls:

If a module should be quantized at all
Does the module quantize at global level quantize config (inherit base) or override with different config params
Allow per-module differential quantization vs base
Allow different quant_linear kernels to execute per model since control is now per module, not per model.

So in general, we are going to propose and implement, already implemented in GPTQModel, per-module control of quantization and kernel selection.

Qubitium added the bug Something isn't working label Dec 7, 2024

Qubitium mentioned this issue Dec 10, 2024

Enable GPTQModel #2064

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ kernel inference not compatible with some models #2120

GPTQ kernel inference not compatible with some models #2120

Qubitium commented Dec 7, 2024 •

edited

Loading

Qubitium commented Dec 10, 2024 •

edited

Loading

GPTQ kernel inference not compatible with some models #2120

GPTQ kernel inference not compatible with some models #2120

Comments

Qubitium commented Dec 7, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Qubitium commented Dec 10, 2024 • edited Loading

Qubitium commented Dec 7, 2024 •

edited

Loading

Qubitium commented Dec 10, 2024 •

edited

Loading