Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

Merged
merged 15 commits into from
Nov 25, 2024

Conversation

BBuf
Copy link
Contributor

@BBuf BBuf commented Nov 25, 2024

Using FP8 quantization for inference on RTX 4090 can significantly improve the performance of both Qwen2-57B and Mixtral 8x7B models.

Copy link
Member

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FP8 monkey patch hasn't succeeded yet. How does this config work for FP8

@zhyncs
Copy link
Member

zhyncs commented Nov 25, 2024

ref

@zhyncs
Copy link
Member

zhyncs commented Nov 25, 2024

cc @ispobock

@BBuf
Copy link
Contributor Author

BBuf commented Nov 25, 2024

The FP8 monkey patch hasn't succeeded yet. How does this config work for FP8

I have not test fp8 fused moe in sglang yet, but tested in vllm:

benchmark:

python3 -m vllm.entrypoints.openai.api_server --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 -tp 4 --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.9 --disable-log-requests --swap-space 16 --kv-cache-dtype fp8 --distributed-executor-backend ray 


python3 benchmarks/benchmark_serving.py --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 --num-prompts 500 --host 127.0.0.1  --save-result --result-dir result --sharegpt-output-len 256 --request-rate 32 --dataset-name sharegpt --dataset-path /mnt/bbuf/upstream-vllm/hongan-data/ShareGPT_V3_unfiltered_cleaned_split.json

Without tuning:

图片

With Tuning:

图片

图片

We can use this optimized config in GTX 4090 once sglang fp8 fused_moe ready.

@zhyncs zhyncs merged commit dd44173 into sgl-project:main Nov 25, 2024
1 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants