[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

BBuf · 2024-11-25T02:24:42Z

Using FP8 quantization for inference on RTX 4090 can significantly improve the performance of both Qwen2-57B and Mixtral 8x7B models.

zhyncs

The FP8 monkey patch hasn't succeeded yet. How does this config work for FP8

zhyncs · 2024-11-25T02:34:50Z

ref

sglang/python/sglang/srt/layers/quantization/__init__.py

Line 64 in 8912b76

Fp8LinearMethod,

zhyncs · 2024-11-25T02:35:17Z

cc @ispobock

BBuf · 2024-11-25T02:36:22Z

The FP8 monkey patch hasn't succeeded yet. How does this config work for FP8

I have not test fp8 fused moe in sglang yet, but tested in vllm:

benchmark:

python3 -m vllm.entrypoints.openai.api_server --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 -tp 4 --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.9 --disable-log-requests --swap-space 16 --kv-cache-dtype fp8 --distributed-executor-backend ray 


python3 benchmarks/benchmark_serving.py --model /mnt/bbuf/Qwen2-57B-A14B-Instruct-FP8 --num-prompts 500 --host 127.0.0.1  --save-result --result-dir result --sharegpt-output-len 256 --request-rate 32 --dataset-name sharegpt --dataset-path /mnt/bbuf/upstream-vllm/hongan-data/ShareGPT_V3_unfiltered_cleaned_split.json

Without tuning:

With Tuning:

We can use this optimized config in GTX 4090 once sglang fp8 fused_moe ready.

BBuf and others added 15 commits November 12, 2024 16:04

fix a bug in v1_embeeding_request

8c10a24

Merge branch 'sgl-project:main' into main

f2d6418

fix test_embedding_models prompt length too long's bug

9e3bb3d

fix format

97c029c

Merge branch 'sgl-project:main' into main

899b7f7

Merge branch 'sgl-project:main' into main

2afab09

fix a small typo in docs

6e6aec6

Merge branch 'main' into main

a5f4383

format backend.md

4865f98

Apply suggestions from code review

2fec3fe

Update docs/references/hyperparameter_tuning.md

b613bb6

Merge branch 'sgl-project:main' into main

b8dbac7

Merge branch 'sgl-project:main' into main

28be6f9

add tuning fused configs for qwen2 57b and mixtral 8x7b

39b3309

revert typo

ff9cd12

BBuf requested review from merrymercy, Ying1123, zhyncs and ispobock as code owners November 25, 2024 02:24

zhyncs reviewed Nov 25, 2024

View reviewed changes

zhyncs approved these changes Nov 25, 2024

View reviewed changes

zhyncs merged commit dd44173 into sgl-project:main Nov 25, 2024
1 of 13 checks passed

BBuf mentioned this pull request Nov 27, 2024

[benchmark] Add fused_moe_triton benchmark and tuning tools #2225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

BBuf commented Nov 25, 2024

zhyncs left a comment

zhyncs commented Nov 25, 2024

zhyncs commented Nov 25, 2024

BBuf commented Nov 25, 2024

[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b #2167

Conversation

BBuf commented Nov 25, 2024

zhyncs left a comment

Choose a reason for hiding this comment

zhyncs commented Nov 25, 2024

zhyncs commented Nov 25, 2024

BBuf commented Nov 25, 2024