Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ kernel inference not compatible with some models #2120

Open
2 of 4 tasks
Qubitium opened this issue Dec 7, 2024 · 1 comment
Open
2 of 4 tasks

GPTQ kernel inference not compatible with some models #2120

Qubitium opened this issue Dec 7, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Qubitium
Copy link
Contributor

Qubitium commented Dec 7, 2024

System Info

Any

Who can help?

@SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

There is an outstanding issue where model support is not guaranteed for inference even if the GPTQ QuantLinear selection matches all bits, sym, group_size, desc_act properties. User has no way of knowing until they run the model, even if correctly quantized, and crashes with little ability to auto-correct.

Bug origin: Most of the GPTQ kernels are optimized/written where specific divisibility conditions must be met when it comes to in_features, out_features, group_size. This has been mostly fixed/bypassed in GPTQModel and a revised solution is necessary for transformer/optimum.

@SunMarc We plan to submit a PR/proposal to fix this after PR ##2064 as our proposed changes will further complicate/delay the review process.

Expected behavior

All gptq qunatized models on HF should run without error.

@Qubitium Qubitium added the bug Something isn't working label Dec 7, 2024
@Qubitium
Copy link
Contributor Author

Qubitium commented Dec 10, 2024

In addition to the first post on technical requirements imposed by the kernels, there are assumptions in the current GPTQ code in optimum/transformers that may not make sense moving forward:

  1. Single QuantLinear class for entire model
  2. Single quantization config applied to entire model

Both are actually not guaranteed to be singular in real world. Models like Hymba already demonstrates that layers are no longer single repeats of previous layer. Quantization already happens at the sub-module level, for the most part, for each layer but control is not there in optimum/peft/transformers. In GPTQModel we already have a dynamic config within config that controls:

  • If a module should be quantized at all
  • Does the module quantize at global level quantize config (inherit base) or override with different config params
  • Allow per-module differential quantization vs base
  • Allow different quant_linear kernels to execute per model since control is now per module, not per model.

So in general, we are going to propose and implement, already implemented in GPTQModel, per-module control of quantization and kernel selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant