Mixed precision insight for Llama3-8B #8024

ZeusXuan · 2024-06-20T02:04:08Z

ZeusXuan
Jun 20, 2024

I wonder whether the strategy use_more_bit for different layers inspired by Llama-v1 can also be the better policy for Llama-v3：

For llama3-8B, my experiment compares three different plans when token embedding is Q4_K and LM head is Q6_K:

Tensors in all layers are Q4_K（row0）
Choose four consecutive layers for Q6_K separately (32 in total) (row1~8)
Quantize four layers in leaps to Q6_K(row9~10)

Observations:
1.The first few layers are important for generation quality
2.The last few layers and the selection of a few layers in a jump may not be as important for generation quality as originally believed in use_more_bit strategy

Maybe we need more different mixed-precision insights for different LLM models rather than only use_more_bit or some toolkits for quantization sensitivity analysis

ggerganov · 2024-06-20T07:26:14Z

ggerganov
Jun 20, 2024
Maintainer

An automated per-model strategy for distributing the bits would be great to have. I am not sure what is the best way to achieve it. At some point I was thinking about a tool that compares the activations per-layer and applies some optimization strategy to improve the distribution of bits (#2783). We now how the tools (e.g. eval_callback), so we should be able to perform the necessary analysis and devise different strategies. We can also implement the strategy used by picoLLM. Suggestions are welcome

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed precision insight for Llama3-8B #8024

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Mixed precision insight for Llama3-8B #8024

ZeusXuan Jun 20, 2024

Replies: 1 comment

ggerganov Jun 20, 2024 Maintainer

ZeusXuan
Jun 20, 2024

ggerganov
Jun 20, 2024
Maintainer