-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ support? #26
Comments
Thanks @tigerinus for bringing up this feature request. I think we can enable loading and building engines from TheBloke GPTQ models pretty easily, would that work for you? |
That would be perfect. How soon can it happen? |
maybe support more quantization method, like AWQ |
@tigerinus @mfuntowicz TheBloke's GPTQ support will be awesome! |
Thanks for your comments! Tentatively targeting AWQ/GPTQ from TheBloke on the 🤗 Hub in the next iteration (i.e. Stay tuned! |
@mfuntowicz it will be awesome! ❤️ |
Hey-hey @mfuntowicz is GPTQ still in the plans? |
Hello @dimaischenko, I was checking the examples, and if you see this part: if args.has_quantization_step:
from optimum.nvidia.quantization import get_default_calibration_dataset
max_length = min(args.max_prompt_length + args.max_new_tokens, tokenizer.model_max_length)
calib = get_default_calibration_dataset(args.num_calibration_samples)
if hasattr(calib, "tokenize"):
calib.tokenize(tokenizer, max_length=max_length, pad_to_multiple_of=8)
# Add the quantization step
builder.with_quantization_profile(args.quantization_config, calib) Then it likely supports quantization. So what you can try to do is, once the docker installation is done, you can run the examples/text-generation/llama.py file where the model folder should contain the autogpq file and hopefully that can work. And while running you might be required to put the flag: |
@Anindyadeep Thank you! I need a little time to figure it out. I'm using the lowest level generation of model = AutoGPTQForCausalLM.from_quantized(model_name, ...)
...
ids = model.prepare_inputs_for_generation(
batch_input_ids,
past_key_values=past_key_values,
attention_mask=attention_mask,
use_cache=True,
**model_kwargs)
out = model(**ids) Need to find out if your option is right for me |
Awesome, and let me also know if it does work or not. |
I see this is not closed yet - Is GPTQ still not supported yet? |
Very new to this library...just need a quick answer if GPTQ is going to be supported.
Need to inference LLaMA2-13b-GPTQ on RTX 4060 ti
The text was updated successfully, but these errors were encountered: