[INTEGRATION] Expose stable kernel/packing/repacking apis #726

wenhuach21 · 2024-12-03T02:08:43Z

when pip install, marlin kernel could not find ValueError: Trying to use the marlin backend, but could not import the C++/CUDA dependencies with the following error: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/wenhuach/anaconda3/envs/autoround/lib/python3.10/site-packages/gptqmodel_marlin_cuda_inference.cpython-310-x86_64-linux-gnu.so)

when install from source

Qubitium · 2024-12-03T02:16:50Z

@wenhuach21 It appears there are two issues.

Pip install failed. Can you show the stacktrace for pip installed marlin error? It maybe caused by our cached whl prebuilt.

Need linux os version, kernel, libc/glibc version

Source build error. Can you confirm which commit or release tag you are using for source install?

Thanks. @CSY-ModelCloud

CSY-ModelCloud · 2024-12-03T02:23:07Z

We have renamed gptqmodel_marlin_cuda_inference. Can you try to pull latest and delete build dir? Then pip install it.

wenhuach21 · 2024-12-03T02:33:56Z

Got it. It would be beneficial for GPTQModel to provide a backward-compatible API for layer packing and repacking, accommodating both the original AutoGPTQ linear layer and your/AutoRound fixed zero-point layer in future implementations. This would allow seamless reliance on your CUDA kernels for Marlin, asymmetric quantization, and other operations in AutoRound side.

Qubitium · 2024-12-03T02:45:26Z

Got it. It would be beneficial for GPTQModel to provide a backward-compatible API for layer packing and repacking, accommodating both the original AutoGPTQ linear layer and your/AutoRound fixed zero-point layer in future implementations. This would allow seamless reliance on your CUDA kernels for Marlin, asymmetric quantization, and other operations in AutoRound side.

We are adding hf_select_quant_linear as external api for HF/optimum repo. Can autoeound use this? Api is going stable later today/tonight.

Tracking PR: #713

Code is not ready. We are finalizing it still. The above pr holds links to hf/optimum pr that will be submitted upstream.

Qubitium · 2024-12-03T07:12:38Z

[1-3] https://github.com/ModelCloud/GPTQModel/pull/727/files

we will expose the 3 hf_ prefixed as stable api to hf/optimum. May still be changes. wip.

Correction: 4 hf_methods

[4] https://github.com/ModelCloud/GPTQModel/pull/728/files

wenhuach21 · 2024-12-03T07:30:34Z

Thanks for the info. However, this may not help in our side, we need layer-wise packing and repacking as autoround could support mixed bits or mixed group size .

Qubitium · 2024-12-04T17:44:04Z

@wenhuach21 We are currently refactoring and make sure gptqmodel is correctly integrated into transformers/optimum/peft.

Can you list the exact api you want? Feel free-form and imagine any/all api you want/desire to have so that autoround can work with our kernels.

Api stability can be enforced by locking pkg depends to specific release as we cant promise internal apis to be always stable.

Let me know a detailed, preferably with pseudo code to illustrate the usage so I can visualize actual usage scenarios. Be as detailed as possible.

Qubitium · 2024-12-10T03:23:25Z

@wenhuach21 Our refractor is complete and preparing for transformers/optimum/peft upstream prs to be merged and integrated.

Now is a good time to review exactly what you and the intel/auto-round team needs from us explicatively at code-level. Please provide us with detailed (pseudo code is okay) examples show what apis we need to expose.

wenhuach21 · 2024-12-10T09:37:46Z

@wenhuach21 Our refractor is complete and preparing for transformers/optimum/peft upstream prs to be merged and integrated.

Now is a good time to review exactly what you and the intel/auto-round team needs from us explicatively at code-level. Please provide us with detailed (pseudo code is okay) examples show what apis we need to expose.

Sorry for the delayed response. At the moment, the following come to mind as we want to support mixed bits quantization later

Symmetric Quantization

layer.pack(xxx, backend="marlin") ##Packs the layer using the specified format. Actually WrapperLinear is ok if there is no big change int the future.
check_packing_feasibility(xxx, backend) ## check whether the layer and its quantization config could pack with the specified backend
check_best_packing_format(xxx, target_device="cuda") ##return the best performance format in your repository based on the specified bit-width and group size.

Asymmetric Quantization
Since there are differences in the zero-point (zp) settings, the API should include additional arguments to reflect these variations appropriately.

wenhuach21 · 2024-12-10T09:44:01Z

sorry, I forgot the repacking API and pos_init API.

Qubitium · 2024-12-11T01:03:54Z

@wenhuach21 Feel free to open a wip PR and make core changes as you see fit. I can monitor and we can also connect on teams to smooth out ideas. The only things I would require is below:

If the api is exposed externally, add hf_ prefix to the api name. We are doing this for stable apis for transformers/optimum/peft and would follow the same principle here for the layer/packing/repacking changes. So internally it can be def pack but if auto_round wants to call this, we would change this to def hf_pack which may well be just a wrapper for pack but I would still require the wrapper so we can be backward compatible and have stable api.
Add unit tests to the new hf_ external apis.

wenhuach21 added the bug Something isn't working label Dec 3, 2024

Qubitium changed the title ~~[Question] install issue~~ [Feature] Expose stable kernel/packing/repacking apis Dec 4, 2024

Qubitium changed the title ~~[Feature] Expose stable kernel/packing/repacking apis~~ [INTEGRATION] Expose stable kernel/packing/repacking apis Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

wenhuach21 commented Dec 3, 2024

Qubitium commented Dec 3, 2024 •

edited

Loading

CSY-ModelCloud commented Dec 3, 2024

wenhuach21 commented Dec 3, 2024 •

edited

Loading

Qubitium commented Dec 3, 2024 •

edited

Loading

Qubitium commented Dec 3, 2024 •

edited

Loading

wenhuach21 commented Dec 3, 2024

Qubitium commented Dec 4, 2024

Qubitium commented Dec 10, 2024

wenhuach21 commented Dec 10, 2024 •

edited

Loading

wenhuach21 commented Dec 10, 2024

Qubitium commented Dec 11, 2024

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

[INTEGRATION] Expose stable kernel/packing/repacking apis #726

Comments

wenhuach21 commented Dec 3, 2024

Qubitium commented Dec 3, 2024 • edited Loading

CSY-ModelCloud commented Dec 3, 2024

wenhuach21 commented Dec 3, 2024 • edited Loading

Qubitium commented Dec 3, 2024 • edited Loading

Qubitium commented Dec 3, 2024 • edited Loading

wenhuach21 commented Dec 3, 2024

Qubitium commented Dec 4, 2024

Qubitium commented Dec 10, 2024

wenhuach21 commented Dec 10, 2024 • edited Loading

wenhuach21 commented Dec 10, 2024

Qubitium commented Dec 11, 2024

Qubitium commented Dec 3, 2024 •

edited

Loading

wenhuach21 commented Dec 3, 2024 •

edited

Loading

Qubitium commented Dec 3, 2024 •

edited

Loading

Qubitium commented Dec 3, 2024 •

edited

Loading

wenhuach21 commented Dec 10, 2024 •

edited

Loading