The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

junpeiz · 2024-09-03T19:37:11Z

🐛 Describe the bug

In the embedding_4bit implementation here, it assumes the quantized data has quant_min=-8, so it uses weight = weight.view(torch.int8).add(-8) to shift the data.

However, the shifting should be based on the weight_quant_min, which should be weight = weight.view(torch.int8).add(weight_quant_min).

Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes 0 for both weight_quant_min and weight_quant_max, which is incorrect. The whole pipeline works because the weight is quantized by [-8, 7] and the embedding_4bit assumes quant_min=-8. It will fail if the weight is actually quantized by [0, 15].

Versions

Main branch

The text was updated successfully, but these errors were encountered:

JacobSzwejbka · 2024-09-05T16:43:47Z

@jerryzh168 @kimishpatel

JacobSzwejbka added the module: quantization label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

junpeiz commented Sep 3, 2024

JacobSzwejbka commented Sep 5, 2024

The embedding_4bit implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

The embedding_4bit implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

Comments

junpeiz commented Sep 3, 2024

🐛 Describe the bug

Versions

JacobSzwejbka commented Sep 5, 2024

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max #5038

The `embedding_4bit` implementation has a strong assumption about weight_quant_min and weight_quant_max #5038