- 本文以情感二分类为例,使用ONNX+TensorRT来部署
- 注意:注意pytorch,cuda, tensorrt等版本,有些问题可能是版本导致,如转trt错误可能和tensorrt版本相关
-
首先需要运行情感分类任务,并保存pytorch的权重
-
使用了pytorch自带的
torch.onnx.export()
来转换,转换脚本见ONNX转换bert权重
参考TensorRT 8.2.1.8 安装笔记(超全超详细)|Docker 快速搭建 TensorRT 环境中的半自动安装流程,可直接阅读源文档
- 官网下载对应版本的镜像(个人根据具体cuda版本选择)
docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
- 运行镜像/创建容器
docker run -it --name trt_test --gpus all -v /home/libo/tensorrt:/tensorrt nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 /bin/bash
- 下载TensorRT包,这一步需要注册账号,我下载的是
TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
- 回到容器安装TensorRT(cd到容器内的tensorrt路径下解压刚才下载的tar包)
tar -zxvf TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
- 添加环境变量
# 安装vim
apt-get update
apt-get install vim
vim ~/.bashrc
export LD_LIBRARY_PATH=/tensorrt/TensorRT-8.4.1.5/lib:$LD_LIBRARY_PATH
source ~/.bashrc
- 安装 python(安装之后输入python查看安装的版本,下一步要用到)
apt-get install -y --no-install-recommends \
python3 \
python3-pip \
python3-dev \
python3-wheel &&\
cd /usr/local/bin &&\
ln -s /usr/bin/python3 python &&\
ln -s /usr/bin/pip3 pip;
- pip安装对应的TensorRT库 注意一定要使用pip本地安装tar附带的对应python版本的whl包
cd TensorRT-8.4.1.5/python/
pip3 install tensorrt-8.2.1.8-cp36-none-linux_x86_64.whl
- 测试TensorRT的python接口
import tensorrt
print(tensorrt.__version__)
- 转换命令
cd TensorRT-8.4.1.5/bin
./trtexec --onnx=/tensorrt/bert_cls.onnx --saveEngine=/tensorrt/bert_cls.trt --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
- 注意项:1)测试中如果把batch_size维度和seq_len维度都设置成动态速度会很慢(100ms+),因此这里只保留动态的batchsize维度,seq_len都padding到512;2)参考资料
import numpy as np
from bert4torch.tokenizers import Tokenizer
import tensorrt as trt
import common
import time
import numpy as np
from tqdm import tqdm
"""
a、获取 engine,建立上下文
"""
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def get_engine(engine_file_path):
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
return engine
engine_model_path = "bert_cls.trt"
# Build a TensorRT engine.
engine = get_engine(engine_model_path)
# Contexts are used to perform inference.
context = engine.create_execution_context()
"""
b、从engine中获取inputs, outputs, bindings, stream 的格式以及分配缓存
"""
def to_numpy(tensor):
for i, item in enumerate(tensor):
tensor[i] = item + [0] * (512-len(item))
return np.array(tensor, np.int32)
dict_path = '/tensorrt/vocab.txt'
tokenizer = Tokenizer(dict_path, do_lower_case=True)
sentences = ['你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门。']
input_ids, segment_ids = tokenizer.encode(sentences)
tokens_id = to_numpy(input_ids)
segment_ids = to_numpy(segment_ids)
context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0) # (1,-1)
origin_inputshape[0],origin_inputshape[1] = tokens_id.shape # (batch_size, max_sequence_length)
context.set_binding_shape(0, (origin_inputshape))
context.set_binding_shape(1, (origin_inputshape))
"""
c、输入数据填充
"""
inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
inputs[0].host = tokens_id
inputs[1].host = segment_ids
"""
d、tensorrt推理
"""
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print("====preds====:",preds)
"""
e、测试耗时(不含构造数据)
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt(不含构造数据): ', (time.time()-start)*1000/steps, ' ms')
"""
f、测试耗时(含构造数据)
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
input_ids, segment_ids = tokenizer.encode(sentences)
tokens_id = to_numpy(input_ids)
segment_ids = to_numpy(segment_ids)
inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
inputs[0].host = tokens_id
inputs[1].host = segment_ids
common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt(含构造数据): ', (time.time()-start)*1000/steps, ' ms')
- 所需common.py
- 运行结果
Reading engine from file bert_cls.trt
onnx_tensorrt.py:44: DeprecationWarning: Use set_optimization_profile_async instead.
context.active_optimization_profile = 0
====preds====: [1]
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 78.87it/s]
onnx+tensorrt(不含构造数据): 12.6955246925354 ms
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 71.50it/s]
onnx+tensorrt(含构造数据): 13.990252017974854 ms
- 测试方式: btz=1, seq_len=202(对于tensorrt测试了seq_len=202和512), iterations=100
方案 | cpu | gpu |
---|---|---|
pytorch | 144ms | 29ms |
onnx | 66ms | —— |
onnx+tensorrt | —— | 7ms (len=202), 12ms (len=512) |
tensorrt
├─common.py
├─onnx_tensorrt.py
├─bert_cls.onnx
├─bert_cls.trt
└─TensorRT-8.4.1.5
- docker镜像: 1)可按上述方式自行构建,2)直接pull笔者上传的镜像
docker pull tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5
docker run -it --name trt_torch --gpus all -v /home/libo/tensorrt:/tensorrt tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5 /bin/bash