Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about dequantization overhead #23

Open
DD-DuDa opened this issue Jul 6, 2024 · 3 comments
Open

Question about dequantization overhead #23

DD-DuDa opened this issue Jul 6, 2024 · 3 comments

Comments

@DD-DuDa
Copy link

DD-DuDa commented Jul 6, 2024

Thanks for your great work. I want to learn how to calculate the dequantization overhead, like in Figure 18, since the dequantization process is within a single kernel.

image
@ys-2020
Copy link
Contributor

ys-2020 commented Jul 13, 2024

Hi @DD-DuDa , thank you very much for your interests in QServe. We real-measured the dequant overheads of the above kernels. We compared the actual throughputs of GEMM kernels with dequantization and kernels in which dequantization ops are skipped. The difference of throughputs between the two version of kernels is regarded as dequant overhead.

@DD-DuDa
Copy link
Author

DD-DuDa commented Jul 14, 2024

Got it! Thank you for your response!

@brisker
Copy link

brisker commented Jul 29, 2024

@ys-2020
Screenshot_20240729_194915_com hikvision moa
In the formula(5) in your paper, why is the per group scale uint8? Why uint4-uint4 multiplied by uint8 can still be sint8?
Is that a typo? This is quite confusing.(In my understanding, the per group scale should be also 4bit to generate a sint8-w)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants