Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ support? #26

Open
tigerinus opened this issue Dec 7, 2023 · 11 comments
Open

GPTQ support? #26

tigerinus opened this issue Dec 7, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@tigerinus
Copy link

Very new to this library...just need a quick answer if GPTQ is going to be supported.

Need to inference LLaMA2-13b-GPTQ on RTX 4060 ti

@mfuntowicz mfuntowicz added the enhancement New feature or request label Dec 7, 2023
@mfuntowicz mfuntowicz self-assigned this Dec 7, 2023
@mfuntowicz
Copy link
Member

Thanks @tigerinus for bringing up this feature request.

I think we can enable loading and building engines from TheBloke GPTQ models pretty easily, would that work for you?

@tigerinus
Copy link
Author

Thanks @tigerinus for bringing up this feature request.

I think we can enable loading and building engines from TheBloke GPTQ models pretty easily, would that work for you?

That would be perfect. How soon can it happen?

@heurainbow
Copy link

maybe support more quantization method, like AWQ

@dimaischenko
Copy link

dimaischenko commented Dec 11, 2023

I think we can enable loading and building engines from TheBloke GPTQ models pretty easily, would that work for you?

@tigerinus @mfuntowicz TheBloke's GPTQ support will be awesome!

@mfuntowicz
Copy link
Member

mfuntowicz commented Dec 18, 2023

Thanks for your comments! Tentatively targeting AWQ/GPTQ from TheBloke on the 🤗 Hub in the next iteration (i.e. 0.1.0b3) which should happens around next week.

Stay tuned!

@dimaischenko
Copy link

Thanks for your comments! Tentatively targeting AWQ/GPTQ from TheBloke on the 🤗 Hub in the next iteration (i.e. 0.1.0b3) which should happens around next week.

Stay tuned!

@mfuntowicz it will be awesome! ❤️

@mfuntowicz mfuntowicz added this to the 0.1.0b3 milestone Dec 21, 2023
@dimaischenko
Copy link

Hey-hey @mfuntowicz is GPTQ still in the plans?

@Anindyadeep
Copy link

Anindyadeep commented Jan 13, 2024

Hello @dimaischenko, I was checking the examples, and if you see this part:

    if args.has_quantization_step:
        from optimum.nvidia.quantization import get_default_calibration_dataset

        max_length = min(args.max_prompt_length + args.max_new_tokens, tokenizer.model_max_length)
        calib = get_default_calibration_dataset(args.num_calibration_samples)

        if hasattr(calib, "tokenize"):
            calib.tokenize(tokenizer, max_length=max_length, pad_to_multiple_of=8)

        # Add the quantization step
        builder.with_quantization_profile(args.quantization_config, calib)

https://github.com/huggingface/optimum-nvidia/blame/c5301b3a0debe4a852e1a11e460e76b638f59312/examples/text-generation/llama.py#L88-L101

Then it likely supports quantization. So what you can try to do is, once the docker installation is done, you can run the examples/text-generation/llama.py file where the model folder should contain the autogpq file and hopefully that can work. And while running you might be required to put the flag: --has_quantization_step to go for supporting that file.

@dimaischenko
Copy link

@Anindyadeep Thank you! I need a little time to figure it out. I'm using the lowest level generation of AutoGPTQForCausalLM right now. Something like:

model = AutoGPTQForCausalLM.from_quantized(model_name, ...)

...

ids = model.prepare_inputs_for_generation(
    batch_input_ids,
    past_key_values=past_key_values,
    attention_mask=attention_mask,
    use_cache=True,
    **model_kwargs)

out = model(**ids)

Need to find out if your option is right for me

@Anindyadeep
Copy link

@Anindyadeep Thank you! I need a little time to figure it out. I'm using the lowest level generation of AutoGPTQForCausalLM right now. Something like:

model = AutoGPTQForCausalLM.from_quantized(model_name, ...)

...

ids = model.prepare_inputs_for_generation(
    batch_input_ids,
    past_key_values=past_key_values,
    attention_mask=attention_mask,
    use_cache=True,
    **model_kwargs)

out = model(**ids)

Need to find out if your option is right for me

Awesome, and let me also know if it does work or not.

@tigerinus
Copy link
Author

I see this is not closed yet - Is GPTQ still not supported yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants