[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

HaiShaw · 2024-10-29T19:00:21Z

…odel with kv_scale

Motivation

Reuse FP8 quantized Mixtral models, to avoid KeyError, just skip/ignore kv_scale for now.
E.g. amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV has kv_scale embedded, run following command to ignore them for now.
python -m sglang.bench_latency --model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV --tp 8 --batch-size 32 --input 1024 --output 256 --quant fp8

Similar run command to neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8

Modifications

Skip loading named parameter ends with .kv_scale.
None scaled FP8 kv cache still works, just add --kv-cache-dtype fp8_e5m2
Later, we will revisit this as part of e4m3, etc. format kv scaling design.

Checklist

[+] Format your code according to the Contributor Guide.
[+] Add unit tests as outlined in the Contributor Guide.
[+] Update documentation as needed, including docstrings or example tutorials.

…odel with kv_scale

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m…

3d4f89b

…odel with kv_scale

HaiShaw requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners October 29, 2024 19:00

merrymercy merged commit 54dd3ea into sgl-project:main Oct 29, 2024
11 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

HaiShaw commented Oct 29, 2024 •

edited

Loading

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

Conversation

HaiShaw commented Oct 29, 2024 • edited Loading

Motivation

Modifications

Checklist

HaiShaw commented Oct 29, 2024 •

edited

Loading