-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace auto_gptq by gptqmodel in HuggingFace/Optimum #536
Comments
Version 1.2 with ipex should be released within the next 24 hours after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr. |
Great, I only left some minus fixes for examples, please merge #540 . |
v1.2.1 released. We now need to plot what code/features in optimum and transformers are dependent on old auto-gptq so we can create to do list and check off each one. |
The core function is here huggingface/optimum/blob/main/optimum/gptq/quantizer.py. The others are mostly lib checks or guidance in readme or code comments. |
@jiqing-feng transformers calls optimum so we need to PR both at the same time. We have another issue, which is hf gptq loading code in We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium. |
Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend. If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot. |
We have identified a problem with hf transformer, and our gptqmodel code too, in which the separation of quantization temp For example, I plan to address this part in our PRs.
Can you give me a code example of where IPEX would need to alter the persistent quantized config attributes post-quantization? (as it related to the quantization_config that persist in the json file or config.json) One example will help a lot to see where IPEX's usage case is coming from. Thanks.
You can email me at qubitium@modelcloud.ai and my time is pretty flexible. |
I will take AWQ as an example because it's already integrated into transformers. Please install transformers and AutoAWQ from the main repo, and run the following script on an Intel Xeon CPU. If you don't have such a device, I will show this case in our meeting, maybe 2pm in Beijing time tomorrow (11/15) ? from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AwqConfig
model_id = "PrunaAI/JackFram-llama-68m-AWQ-4bit-smashed"
text = ["I am happy because", "This is"]
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer(text, return_tensors="pt", padding=True)
quantization_config = AwqConfig(version="ipex")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", quantization_config=quantization_config)
model.generation_config.cache_implementation = "static"
model.generate(**input_ids) |
And GPTQModel equivalent, if we change transformer code, would be
We only have consumer Intel 13th gen and EPYC 7003 (Zen3) and 7950X (zen4 desktop) both has AVX512. Which intel instructions does IPEX require?
Sure. Please email me your contacts and we can take from there. |
|
Going to list the issues/diffs that we found here: (Will update as more are found) REF: First PR that AutoGPTQ was partially merged into optimum: huggingface/optimum#1216
Kernels:AutoGPTQ has: Cuda/Packer, Triton v1/Packer, Triton v2/Packer, Exllama v1/Packer Exllama v2/(no-packer), Marlin/(Marlin packer) GPTQModel has: Triton v2/Packer, Exllama v2, NM Marlin/(Marlin Packer) Need to retest cuda vs triton v2 to see which is faster for quant and pack including with torch.compile() in torch 2.5.1 since we need to re-add back this kernel for hf/optimum compat. Unsure they will accept another, triton depend. History:
Cuda kernel: With torch 2.5.1 changes, it may be faster or as fast as Triton v2. Again, we need to test now since optimum relies on cuda kernel by default. |
Cuda kernel has been re-added back for pending optimum compat. |
@jiqing-feng Instead of a optimum calling gptqmodel internal api such as This will introduce separation of optimum hooking into internal apis that we may chnage in the future while at the same time improve api stability of these separate projects. |
Discussion merged into #729 |
Hi @Qubitium . Since the CPU path is already in gptqmodel, when do you plan to replace auto_gptq to gptqmodel in HuggingFace/optimum? I think we can start an issue in Optimum to let the maintainer know as early as possible.
Please let me know if there is anything I can do to move on to the goal. Thx.
The text was updated successfully, but these errors were encountered: