Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The embedding_4bit implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

Open
junpeiz opened this issue Sep 3, 2024 · 1 comment

Comments

@junpeiz
Copy link

junpeiz commented Sep 3, 2024

🐛 Describe the bug

In the embedding_4bit implementation here, it assumes the quantized data has quant_min=-8, so it uses weight = weight.view(torch.int8).add(-8) to shift the data.

However, the shifting should be based on the weight_quant_min, which should be weight = weight.view(torch.int8).add(weight_quant_min).

Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0 for both weight_quant_min and weight_quant_max, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7] and the embedding_4bit assumes quant_min=-8. It will fail if the weight is actually quantized by [0, 15].

Versions

Main branch

@JacobSzwejbka
Copy link
Contributor

@jerryzh168 @kimishpatel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants