Llama3.1: Median decode latency is high with batch size 128 on the Triton backend #1935

Jackycheng0808 · 2024-11-06T08:18:42Z

Jackycheng0808
Nov 6, 2024

Docker Image: sglang 0.3.4.post2
Hardware: H200
Command: python3 -m sglang.bench_latency --batch-size 128 --input 128 --output 128 --model "amd/Meta-Llama-3.1-8B-Instruct-FP8-KV" --quantization fp8 --tp 1

Hi, I am testing Llama 3.1 on different backends (Triton & FlashInfer) and have found some unusual behavior on the Triton backend. The median decode latency and total latency significantly increase from batch size 64 to batch size 128, then drop back down at batch size 256. Compared to the FlashInfer backend, the latency is 5x slower at batch size 128. After profiling, I realized the _fwd_grouped_kernel_stage1 kernel takes ~90% of the execution time, while the BatchDecode kernel of the FlashInfer engine only takes 24%. I am wondering if there might be an issue in the fwd_grouped_kernel_stage1 Triton kernel implementation?

zhyncs · 2024-11-18T15:51:04Z

zhyncs
Nov 18, 2024
Maintainer

@Jackycheng0808 Nice analysis!
Feel free to join our slack channel

0 replies

zhyncs · 2024-11-18T15:51:20Z

zhyncs
Nov 18, 2024
Maintainer

cc @ispobock @ByronHsu

0 replies

ispobock · 2024-11-19T16:17:28Z

ispobock
Nov 19, 2024
Maintainer

Hi @Jackycheng0808 Could you test with --disable-cuda-graph?

0 replies

merrymercy · 2024-11-23T05:51:49Z

merrymercy
Nov 23, 2024
Maintainer

We might find the problem you can also add --context-length 4096 for your triton launch command.

1 reply

Jackycheng0808 Nov 24, 2024
Author

Thank you for the suggestion! I added --context-length 4096 to both the Triton and FlashInfer backends. As you predicted, it reduced the time spent on the Triton backend. However, I noticed it increased the time spent on the FlashInfer backend. Is this behavior expected? Could you explain why adding --context-length 4096 helps address the issue on the Triton backend?

ispobock · 2024-11-23T07:11:20Z

ispobock
Nov 23, 2024
Maintainer

@Jackycheng0808 Thanks for reporting the issue and providing helpful data for debugging! It will be fixed in #2134.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3.1: Median decode latency is high with batch size 128 on the Triton backend #1935

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Llama3.1: Median decode latency is high with batch size 128 on the Triton backend #1935

Jackycheng0808 Nov 6, 2024

Replies: 5 comments · 1 reply

zhyncs Nov 18, 2024 Maintainer

zhyncs Nov 18, 2024 Maintainer

ispobock Nov 19, 2024 Maintainer

merrymercy Nov 23, 2024 Maintainer

Jackycheng0808 Nov 24, 2024 Author

ispobock Nov 23, 2024 Maintainer

Jackycheng0808
Nov 6, 2024

Replies: 5 comments 1 reply

zhyncs
Nov 18, 2024
Maintainer

zhyncs
Nov 18, 2024
Maintainer

ispobock
Nov 19, 2024
Maintainer

merrymercy
Nov 23, 2024
Maintainer

Jackycheng0808 Nov 24, 2024
Author

ispobock
Nov 23, 2024
Maintainer