The embedding_4bit
implementation has a strong assumption about weight_quant_min and weight_quant_max
#5038
Labels
embedding_4bit
implementation has a strong assumption about weight_quant_min and weight_quant_max
#5038
🐛 Describe the bug
In the
embedding_4bit
implementation here, it assumes the quantized data hasquant_min=-8
, so it usesweight = weight.view(torch.int8).add(-8)
to shift the data.However, the shifting should be based on the
weight_quant_min
, which should beweight = weight.view(torch.int8).add(weight_quant_min)
.Fixing this also requires the op construction passes the correct min and max. In llama example here, it passes
0
for bothweight_quant_min
andweight_quant_max
, which is incorrect. The whole pipeline works because the weight is quantized by[-8, 7]
and theembedding_4bit
assumesquant_min=-8
. It will fail if the weight is actually quantized by[0, 15]
.Versions
Main branch
The text was updated successfully, but these errors were encountered: