Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

Open
anaivebird opened this issue Nov 28, 2024 · 0 comments
Open

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

anaivebird opened this issue Nov 28, 2024 · 0 comments

Comments

@anaivebird
Copy link

qserve result:

2024-11-28 17:43:06,524 [INFO] Performance Summary


Successful Request 359
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1737.53
Avg_Gen_Token_Len 1000.3
Elapse_Time (s) 226.188
Time_to_First_Token_AVG (s) 9.957
Time_to_First_Token_P99 (s) 30.965
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.03
Latency_P90 (s) 57.549
Latency_P95 (s) 58.187
Latency_P99 (s) 61.007
Latency_AVG (s) 34.043
Token QPS (token/s) 1587.65
Service QPS (req/s) 1.59

2024-11-28 17:45:24,752 [INFO] Performance Summary


Successful Request 208
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1802.95
Avg_Gen_Token_Len 994.21
Elapse_Time (s) 135.085
Time_to_First_Token_AVG (s) 36.664
Time_to_First_Token_P99 (s) 62.527
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.045
Latency_P90 (s) 88.988
Latency_P95 (s) 90.888
Latency_P99 (s) 92.339
Latency_AVG (s) 33.051
Token QPS (token/s) 1530.85
Service QPS (req/s) 1.54

awq result:

2024-11-28 17:51:26,140 [INFO] Performance Summary


Successful Request 369
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1726.56
Avg_Gen_Token_Len 952.3
Elapse_Time (s) 212.125
Time_to_First_Token_AVG (s) 8.244
Time_to_First_Token_P99 (s) 29.357
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.062
Latency_P90 (s) 53.352
Latency_P95 (s) 55.721
Latency_P99 (s) 58.419
Latency_AVG (s) 31.806
Token QPS (token/s) 1656.56
Service QPS (req/s) 1.74

2024-11-28 17:53:14,427 [INFO] Performance Summary


Successful Request 177
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1804.7
Avg_Gen_Token_Len 931.08
Elapse_Time (s) 105.276
Time_to_First_Token_AVG (s) 30.793
Time_to_First_Token_P99 (s) 59.689
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.072
Latency_P90 (s) 72.126
Latency_P95 (s) 86.212
Latency_P99 (s) 88.854
Latency_AVG (s) 24.425
Token QPS (token/s) 1565.43
Service QPS (req/s) 1.68

build commands:

#qserve engine build

cd /app/tensorrt_llm/examples/llama
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quant-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

trtllm-build --checkpoint_dir /root/trtllm-llama2-7b \
            --output_dir /root/engine-llama2-7b \
            --gemm_plugin auto


#awq int4 engine build

convert_script=../llama/convert_checkpoint.py
quantize_script=../quantization/quantize.py
model_dir=/root/llama2-7b
output_dir=/root/awq-llama2-7b
tp=1
python3 ../quantization/quantize.py --model_dir ${model_dir} \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
                                   --calib_size 128 \
                                   --batch_size 1 \
                                   --calib_max_seq_length 2048

trtllm-build --checkpoint_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
             --output_dir $output_dir/llama-trt-engine-awq-int4-${tp}gpu/ \
                         --gemm_plugin float16 \
                         --use_paged_context_fmha enable \
                         --max_num_tokens 13120 \
                         --max_seq_len 4096 \
                         --max_batch_size 128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant