Replies: 1 comment
-
An automated per-model strategy for distributing the bits would be great to have. I am not sure what is the best way to achieve it. At some point I was thinking about a tool that compares the activations per-layer and applies some optimization strategy to improve the distribution of bits (#2783). We now how the tools (e.g. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I wonder whether the strategy
use_more_bit
for different layers inspired by Llama-v1 can also be the better policy for Llama-v3:For llama3-8B, my experiment compares three different plans when token embedding is Q4_K and LM head is Q6_K:
Observations:
1.The first few layers are important for generation quality
2.The last few layers and the selection of a few layers in a jump may not be as important for generation quality as originally believed in
use_more_bit
strategyMaybe we need more different mixed-precision insights for different LLM models rather than only
use_more_bit
or some toolkits for quantization sensitivity analysisBeta Was this translation helpful? Give feedback.
All reactions