Skip to content

Latest commit

 

History

History
144 lines (116 loc) · 10.3 KB

performance.md

File metadata and controls

144 lines (116 loc) · 10.3 KB

Performance of TensorRT-LLM

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Methodology

The different performance numbers below were collected using the methodology described in the benchmarks folder.

High Throughput

The below tables provide reference data at large batch sizes, representating high throughput tasks.

H100 GPUs (FP8)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s)
GPT-J 6B 64 1 128 128 10,907
GPT-J 6B 64 1 128 2048 6,179
GPT-J 6B 64 1 2048 128 2,229
GPT-J 6B 64 1 2048 2048 2,980
LLaMA 7B 64 1 128 128 9,193
LLaMA 7B 64 1 128 2048 5,367
LLaMA 7B 64 1 2048 128 2,058
LLaMA 7B 32 1 2048 2048 2,230
LLaMA 70B 64 4 128 128 3,317
LLaMA 70B 64 4 128 2048 2,616
LLaMA 70B 64 4 2048 128 843
LLaMA 70B 64 4 2048 2048 1,583
Falcon 180B 96 8 128 128 2,686
Falcon 180B 96 8 128 2048 2,073
Falcon 180B 64 8 2048 128 465

L40S GPUs (FP8)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s)
GPT-J 6B 64 1 128 128 3,630
GPT-J 6B 64 1 128 2048 1,859
GPT-J 6B 32 1 2048 128 616
GPT-J 6B 32 1 2048 2048 757
LLaMA 7B 64 1 128 128 3,240
LLaMA 7B 64 1 128 2048 1,622
LLaMA 7B 32 1 2048 128 581
LLaMA 7B 16 1 2048 2048 531

A100 GPUs (FP16)

Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s)
GPT-J 6B 64 1 128 128 3,679
GPT-J 6B 32 1 128 2048 1,558
GPT-J 6B 32 1 2048 128 526
GPT-J 6B 16 1 2048 2048 650
LLaMA 7B 64 1 128 128 3,486
LLaMA 7B 32 1 128 2048 1,459
LLaMA 7B 32 1 2048 128 529
LLaMA 7B 16 1 2048 2048 592
LLaMA 70B 64 4 128 128 1,237
LLaMA 70B 64 4 128 2048 1,181
LLaMA 70B 64 4 2048 128 272
LLaMA 70B 64 4 2048 2048 738
Falcon 180B 64 8 128 128 929
Falcon 180B 64 8 128 2048 923
Falcon 180B 64 8 2048 128 202

(1) TP stands for Tensor Parallelism.

Low Latency

The below tables provide reference data at batch size 1 for first token latency, representating end-user's percieved latency for online streaming tasks.

H100 GPUs (FP8)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 7
GPT-J 6B 1 1 2048 29
LLaMA 7B 1 1 128 7
LLaMA 7B 1 1 2048 36
LLaMA 70B 1 4 128 26
LLaMA 70B 1 4 2048 109
Falcon 180B 1 8 128 27
Falcon 180B 1 8 2048 205

L40S GPUs (FP8)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 12
GPT-J 6B 1 1 2048 71
LLaMA 7B 1 1 128 14
LLaMA 7B 1 1 2048 73

A100 GPUs (FP16)

Model Batch Size TP (1) Input Length 1st Token Latency (ms)
GPT-J 6B 1 1 128 12
GPT-J 6B 1 1 2048 129
LLaMA 7B 1 1 128 16
LLaMA 7B 1 1 2048 133
LLaMA 70B 1 4 128 47
LLaMA 70B 1 4 2048 377
Falcon 180B 1 8 128 61
Falcon 180B 1 8 2048 509

(1) TP stands for Tensor Parallelism.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

There are different possible implementations for Matmul followed by Gated-SiLU. The simplest implementation uses two Matmul operations and combines the results in a separate CUDA kernel. That's the current implementation in TensorRT-LLM. The next release will include a more efficient implementation that runs a single Matmul.