Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… #1835

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

HaiShaw
Copy link
Collaborator

@HaiShaw HaiShaw commented Oct 29, 2024

…odel with kv_scale

Motivation

Reuse FP8 quantized Mixtral models, to avoid KeyError, just skip/ignore kv_scale for now.
E.g. amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV has kv_scale embedded, run following command to ignore them for now.
python -m sglang.bench_latency --model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV --tp 8 --batch-size 32 --input 1024 --output 256 --quant fp8

Similar run command to neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8

Modifications

Skip loading named parameter ends with .kv_scale.
None scaled FP8 kv cache still works, just add --kv-cache-dtype fp8_e5m2
Later, we will revisit this as part of e4m3, etc. format kv scaling design.

Checklist

  • [+] Format your code according to the Contributor Guide.
  • [+] Add unit tests as outlined in the Contributor Guide.
  • [+] Update documentation as needed, including docstrings or example tutorials.

@merrymercy merrymercy merged commit 54dd3ea into sgl-project:main Oct 29, 2024
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants