-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FP8 monkey patch hasn't succeeded yet. How does this config work for FP8
ref
|
cc @ispobock |
I have not test fp8 fused moe in sglang yet, but tested in vllm: benchmark: python3 -m vllm.entrypoints.openai.api_server --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 -tp 4 --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.9 --disable-log-requests --swap-space 16 --kv-cache-dtype fp8 --distributed-executor-backend ray
python3 benchmarks/benchmark_serving.py --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 --num-prompts 500 --host 127.0.0.1 --save-result --result-dir result --sharegpt-output-len 256 --request-rate 32 --dataset-name sharegpt --dataset-path /mnt/bbuf/upstream-vllm/hongan-data/ShareGPT_V3_unfiltered_cleaned_split.json Without tuning: With Tuning: We can use this optimized config in GTX 4090 once sglang fp8 fused_moe ready. |
Using FP8 quantization for inference on RTX 4090 can significantly improve the performance of both Qwen2-57B and Mixtral 8x7B models.