[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…odel with kv_scale
Motivation
Reuse FP8 quantized Mixtral models, to avoid KeyError, just skip/ignore
kv_scale
for now.E.g.
amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
haskv_scale
embedded, run following command to ignore them for now.python -m sglang.bench_latency --model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV --tp 8 --batch-size 32 --input 1024 --output 256 --quant fp8
Similar run command to
neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8
Modifications
Skip loading named parameter ends with
.kv_scale
.None scaled FP8 kv cache still works, just add
--kv-cache-dtype fp8_e5m2
Later, we will revisit this as part of e4m3, etc. format kv scaling design.
Checklist