[Feature] Support fp8 e5m2 kv cache with flashinfer #1204

ispobock · 2024-08-25T08:22:49Z

Motivation

Support fp8 e5m2 kv cache with flashinfer.

Usage

Add --kv-cache-dtype fp8_e5m2 to enable this feature. Currently it only works when flashinfer is not disabled.

Performance & Accuracy

Tested with llama2-13b-chat on A100, the throughput increased by 17.8% without accuracy degradation.

Enable fp8_e5m2 kv cache	Throughput	MMLU (nsub=10) Avg Accuracy	gsm8k Accuracy
N	7.04 req/s	0.487	0.340
Y	8.30 req/s	0.488	0.334

Reproduce

# w/o fp8_e5m2 kv cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-2-13b-chat-hf --port 30000 --trust-remote-code --disable-radix-cache --tp=1
# w/ fp8_e5m2 kv cache
python3 -m sglang.launch_server --model-path meta-llama/Llama-2-13b-chat-hf --port 30000 --trust-remote-code --disable-radix-cache --tp=1 --kv-cache-dtype fp8_e5m2

# benchmark
python3 -m sglang.bench_serving --backend sglang --tokenizer meta-llama/Llama-2-13b-chat-hf  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128

# evaluation
python3 benchmark/mmlu/bench_sglang.py --nsub 10
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319

The performance boost is model dependent. llama3-8b was also tested, but the performance was not improved.

zhyncs · 2024-08-25T08:25:16Z

Nice work! I'll review it asap. May we also support FP8 E4M3?

ispobock · 2024-08-25T08:28:05Z

May we also support FP8 E4M3?

FP8 E4M3 needs scale factor and calibration. We may add it in the future.

python/sglang/srt/mem_cache/memory_pool.py

zhyncs · 2024-08-25T17:30:34Z

python/sglang/srt/model_executor/model_runner.py

+        if self.server_args.kv_cache_dtype == "auto":
+            self.kv_cache_dtype = self.dtype
+        elif self.server_args.kv_cache_dtype == "fp8_e5m2":
+            if self.server_args.disable_flashinfer or self.server_args.enable_mla:


Currently, only FlashInfer is supported and not Triton, due to the issue of insufficient smem. This needs to be fixed in another PR.

zhyncs · 2024-08-25T17:31:55Z

python/sglang/srt/mem_cache/memory_pool.py

+        if cache_v.dtype != self.dtype:
+            cache_v = cache_v.to(self.dtype)
+        if self.store_dtype != self.dtype:
+            self.k_buffer[layer_id][loc] = cache_k.view(self.store_dtype)


workaround for float8_e5m2

Store as torch.uint8 because Tensor index_put is not implemented for torch.float8_e5m2

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

python/sglang/srt/mem_cache/memory_pool.py

qeternity · 2024-09-30T23:24:59Z

Sorry to dig this up but - are we suggesting that fp8 kv cache increased accuracy in both mmlu and gsm8k? Are we sure we don't have those values in the table reversed?

ispobock · 2024-10-12T15:56:01Z

@qeternity In the previous evaluation, I tested gsm8k with 200 questions (default setting in the benchmark script), so the result may not be reliable enough. I just test all the datasets and update the result in the table.

flashinfer fp8 e5m2 kv cache

ada0151

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 August 25, 2024 08:23

zhyncs self-assigned this Aug 25, 2024

zhyncs requested a review from yzh119 August 25, 2024 08:25

Merge branch 'main' into fp8_kv_cache

05c96f8

zhyncs added the feature label Aug 25, 2024

zhyncs mentioned this pull request Aug 25, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

fix ut

831e2ab

zhyncs approved these changes Aug 25, 2024

View reviewed changes

zhyncs added 2 commits August 26, 2024 03:29

Merge branch 'main' into fp8_kv_cache

a403831

Merge branch 'main' into fp8_kv_cache

c104cb5

merrymercy approved these changes Aug 25, 2024

View reviewed changes

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

zhyncs reviewed Aug 25, 2024

View reviewed changes

fix comment

47e0d2b

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

zhyncs enabled auto-merge (squash) August 25, 2024 17:37

merrymercy changed the title ~~Support fp8 e5m2 kv cache with flashinfer~~ [Feature] Support fp8 e5m2 kv cache with flashinfer Aug 25, 2024

merrymercy reviewed Aug 25, 2024

View reviewed changes

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

merrymercy reviewed Aug 25, 2024

View reviewed changes

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

merrymercy reviewed Aug 25, 2024

View reviewed changes

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

Apply suggestions from code review

065442a

merrymercy disabled auto-merge August 26, 2024 00:38

merrymercy merged commit 2c615d1 into sgl-project:main Aug 26, 2024
4 of 5 checks passed

zhyncs mentioned this pull request Aug 30, 2024

[Feature] support W8A8(FP8) and KV Cache FP8 for DeepSeek V2 #1156

Closed

2 tasks

ispobock mentioned this pull request Sep 1, 2024

Support Triton fp8 e5m2 kv cache #1286

Merged

3 tasks

zhyncs mentioned this pull request Sep 4, 2024

fp8 support for old gpus flashinfer-ai/flashinfer#489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support fp8 e5m2 kv cache with flashinfer #1204

[Feature] Support fp8 e5m2 kv cache with flashinfer #1204

ispobock commented Aug 25, 2024 •

edited

Loading

zhyncs commented Aug 25, 2024

ispobock commented Aug 25, 2024

zhyncs Aug 25, 2024

zhyncs Aug 25, 2024

qeternity commented Sep 30, 2024

ispobock commented Oct 12, 2024 •

edited

Loading

[Feature] Support fp8 e5m2 kv cache with flashinfer #1204

[Feature] Support fp8 e5m2 kv cache with flashinfer #1204

Conversation

ispobock commented Aug 25, 2024 • edited Loading

Motivation

Usage

Performance & Accuracy

zhyncs commented Aug 25, 2024

ispobock commented Aug 25, 2024

zhyncs Aug 25, 2024

Choose a reason for hiding this comment

zhyncs Aug 25, 2024

Choose a reason for hiding this comment

qeternity commented Sep 30, 2024

ispobock commented Oct 12, 2024 • edited Loading

ispobock commented Aug 25, 2024 •

edited

Loading

ispobock commented Oct 12, 2024 •

edited

Loading