qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

anaivebird · 2024-11-28T09:58:03Z

qserve result:

2024-11-28 17:43:06,524 [INFO] Performance Summary

Successful Request 359
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1737.53
Avg_Gen_Token_Len 1000.3
Elapse_Time (s) 226.188
Time_to_First_Token_AVG (s) 9.957
Time_to_First_Token_P99 (s) 30.965
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.03
Latency_P90 (s) 57.549
Latency_P95 (s) 58.187
Latency_P99 (s) 61.007
Latency_AVG (s) 34.043
Token QPS (token/s) 1587.65
Service QPS (req/s) 1.59

2024-11-28 17:45:24,752 [INFO] Performance Summary

Successful Request 208
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1802.95
Avg_Gen_Token_Len 994.21
Elapse_Time (s) 135.085
Time_to_First_Token_AVG (s) 36.664
Time_to_First_Token_P99 (s) 62.527
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.045
Latency_P90 (s) 88.988
Latency_P95 (s) 90.888
Latency_P99 (s) 92.339
Latency_AVG (s) 33.051
Token QPS (token/s) 1530.85
Service QPS (req/s) 1.54

awq result:

2024-11-28 17:51:26,140 [INFO] Performance Summary

Successful Request 369
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1726.56
Avg_Gen_Token_Len 952.3
Elapse_Time (s) 212.125
Time_to_First_Token_AVG (s) 8.244
Time_to_First_Token_P99 (s) 29.357
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.062
Latency_P90 (s) 53.352
Latency_P95 (s) 55.721
Latency_P99 (s) 58.419
Latency_AVG (s) 31.806
Token QPS (token/s) 1656.56
Service QPS (req/s) 1.74

2024-11-28 17:53:14,427 [INFO] Performance Summary

Successful Request 177
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1804.7
Avg_Gen_Token_Len 931.08
Elapse_Time (s) 105.276
Time_to_First_Token_AVG (s) 30.793
Time_to_First_Token_P99 (s) 59.689
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.072
Latency_P90 (s) 72.126
Latency_P95 (s) 86.212
Latency_P99 (s) 88.854
Latency_AVG (s) 24.425
Token QPS (token/s) 1565.43
Service QPS (req/s) 1.68

build commands:

#qserve engine build

cd /app/tensorrt_llm/examples/llama
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quant-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

trtllm-build --checkpoint_dir /root/trtllm-llama2-7b \
            --output_dir /root/engine-llama2-7b \
            --gemm_plugin auto


#awq int4 engine build

convert_script=../llama/convert_checkpoint.py
quantize_script=../quantization/quantize.py
model_dir=/root/llama2-7b
output_dir=/root/awq-llama2-7b
tp=1
python3 ../quantization/quantize.py --model_dir ${model_dir} \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
                                   --calib_size 128 \
                                   --batch_size 1 \
                                   --calib_max_seq_length 2048

trtllm-build --checkpoint_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
             --output_dir $output_dir/llama-trt-engine-awq-int4-${tp}gpu/ \
                         --gemm_plugin float16 \
                         --use_paged_context_fmha enable \
                         --max_num_tokens 13120 \
                         --max_seq_len 4096 \
                         --max_batch_size 128

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

anaivebird commented Nov 28, 2024

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

qserve with tensorrt-llm is slower and awq int4 for llama2-7b #46

Comments

anaivebird commented Nov 28, 2024