Skip to content

Latest commit

 

History

History
197 lines (173 loc) · 8.25 KB

task_bert_cls_onnx_tensorrt.md

File metadata and controls

197 lines (173 loc) · 8.25 KB

ONNX+TensorRT

  • 本文以情感二分类为例,使用ONNX+TensorRT来部署
  • 注意:注意pytorch,cuda, tensorrt等版本,有些问题可能是版本导致,如转trt错误可能和tensorrt版本相关

1. pytorch权重转onnx

  1. 首先需要运行情感分类任务,并保存pytorch的权重

  2. 使用了pytorch自带的torch.onnx.export()来转换,转换脚本见ONNX转换bert权重

2. tensorrt环境安装

参考TensorRT 8.2.1.8 安装笔记(超全超详细)|Docker 快速搭建 TensorRT 环境中的半自动安装流程,可直接阅读源文档

  1. 官网下载对应版本的镜像(个人根据具体cuda版本选择)
docker pull nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
  1. 运行镜像/创建容器
docker run -it --name trt_test --gpus all -v /home/libo/tensorrt:/tensorrt nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 /bin/bash
  1. 下载TensorRT包,这一步需要注册账号,我下载的是TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
  2. 回到容器安装TensorRT(cd到容器内的tensorrt路径下解压刚才下载的tar包)
tar -zxvf  TensorRT-8.4.1.5.Linux.x86_64-gnu.cuda-11.6.cudnn8.4.tar.gz
  1. 添加环境变量
# 安装vim
apt-get update
apt-get install vim

vim ~/.bashrc
export LD_LIBRARY_PATH=/tensorrt/TensorRT-8.4.1.5/lib:$LD_LIBRARY_PATH
source ~/.bashrc
  1. 安装 python(安装之后输入python查看安装的版本,下一步要用到)
apt-get install -y --no-install-recommends \
python3 \
python3-pip \
python3-dev \
python3-wheel &&\
cd /usr/local/bin &&\
ln -s /usr/bin/python3 python &&\
ln -s /usr/bin/pip3 pip;
  1. pip安装对应的TensorRT库 注意一定要使用pip本地安装tar附带的对应python版本的whl包
cd TensorRT-8.4.1.5/python/
pip3 install tensorrt-8.2.1.8-cp36-none-linux_x86_64.whl
  1. 测试TensorRT的python接口
import tensorrt
print(tensorrt.__version__)

3. onnx转trt权重

  • 转换命令
cd TensorRT-8.4.1.5/bin
./trtexec --onnx=/tensorrt/bert_cls.onnx --saveEngine=/tensorrt/bert_cls.trt --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
  • 注意项:1)测试中如果把batch_size维度和seq_len维度都设置成动态速度会很慢(100ms+),因此这里只保留动态的batchsize维度,seq_len都padding到512;2)参考资料

4. tensorrt加载模型推理

import numpy as np
from bert4torch.tokenizers import Tokenizer
import tensorrt as trt
import common
import time
import numpy as np
from tqdm import tqdm

"""
a、获取 engine,建立上下文
"""
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def get_engine(engine_file_path):
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        return engine

engine_model_path = "bert_cls.trt"
# Build a TensorRT engine.
engine = get_engine(engine_model_path)
# Contexts are used to perform inference.
context = engine.create_execution_context()


"""
b、从engine中获取inputs, outputs, bindings, stream 的格式以及分配缓存
"""
def to_numpy(tensor):
    for i, item in enumerate(tensor):
        tensor[i] = item + [0] * (512-len(item))
    return np.array(tensor, np.int32)

dict_path = '/tensorrt/vocab.txt'
tokenizer = Tokenizer(dict_path, do_lower_case=True)
sentences = ['你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门去转转呢。你在干嘛呢?这几天外面的天气真不错啊,万里无云,阳光明媚的,我的心情也特别的好,我特别想出门。']
input_ids, segment_ids = tokenizer.encode(sentences)
tokens_id = to_numpy(input_ids)
segment_ids = to_numpy(segment_ids)

context.active_optimization_profile = 0
origin_inputshape = context.get_binding_shape(0)                # (1,-1) 
origin_inputshape[0],origin_inputshape[1] = tokens_id.shape     # (batch_size, max_sequence_length)
context.set_binding_shape(0, (origin_inputshape))               
context.set_binding_shape(1, (origin_inputshape))

"""
c、输入数据填充
"""
inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
inputs[0].host = tokens_id
inputs[1].host = segment_ids

"""
d、tensorrt推理
"""
trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
preds = np.argmax(trt_outputs, axis=1)
print("====preds====:",preds)

"""
e、测试耗时(不含构造数据)
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
    common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt(不含构造数据): ',  (time.time()-start)*1000/steps, ' ms')

"""
f、测试耗时(含构造数据)
"""
steps = 100
start = time.time()
for i in tqdm(range(steps)):
    input_ids, segment_ids = tokenizer.encode(sentences)
    tokens_id = to_numpy(input_ids)
    segment_ids = to_numpy(segment_ids)
    inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
    inputs[0].host = tokens_id
    inputs[1].host = segment_ids

    common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    preds = np.argmax(trt_outputs, axis=1)
print('onnx+tensorrt(含构造数据): ',  (time.time()-start)*1000/steps, ' ms')
Reading engine from file bert_cls.trt
onnx_tensorrt.py:44: DeprecationWarning: Use set_optimization_profile_async instead.
  context.active_optimization_profile = 0
====preds====: [1]
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 78.87it/s]
onnx+tensorrt(不含构造数据):  12.6955246925354  ms
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 71.50it/s]
onnx+tensorrt(含构造数据):  13.990252017974854  ms

5. 速度比较

  • 测试方式: btz=1, seq_len=202(对于tensorrt测试了seq_len=202和512), iterations=100
方案 cpu gpu
pytorch 144ms 29ms
onnx 66ms ——
onnx+tensorrt —— 7ms (len=202), 12ms (len=512)

6. 实验文件

tensorrt
├─common.py
├─onnx_tensorrt.py
├─bert_cls.onnx
├─bert_cls.trt
└─TensorRT-8.4.1.5
  • docker镜像: 1)可按上述方式自行构建,2)直接pull笔者上传的镜像
docker pull tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5

docker run -it --name trt_torch --gpus all -v /home/libo/tensorrt:/tensorrt tongjilibo/tensorrt:11.3.0-cudnn8-devel-ubuntu20.04-tensorrt8.4.1.5 /bin/bash