diff --git a/applications/text_classification/README.md b/applications/text_classification/README.md index 90e062f88a44..90612866d494 100644 --- a/applications/text_classification/README.md +++ b/applications/text_classification/README.md @@ -17,7 +17,7 @@ 文本分类简单来说就是对给定的一个句子或一段文本使用分类模型分类。虽然文本分类在金融、医疗、法律、工业等领域都有广泛的成功实践应用,但如何选择合适的方案和预训练模型、数据标注质量差、效果调优困难、AI入门成本高、如何高效训练部署等问题使部分开发者望而却步。针对文本分类领域的痛点和难点,PaddleNLP文本分类应用提出了多种前沿解决方案,助力开发者简单高效实现文本分类数据标注、训练、调优、上线,降低文本分类落地技术门槛。
- 文本分类落地难点 + 文本分类落地难点
**文本分类应用技术特色:** @@ -36,7 +36,7 @@ ### 2.1 文本分类方案全覆盖
- image + image
#### 2.1.1 分类场景齐全 @@ -66,7 +66,7 @@
- +
@@ -79,18 +79,18 @@ 【方案选择】提示学习(Prompt Learning)适用于**标注成本高、标注样本较少的文本分类场景**。在小样本场景中,相比于预训练模型微调学习,提示学习能取得更好的效果。对于标注样本充足、标注成本较低的场景,我们仍旧推荐使用充足的标注样本进行文本分类[预训练模型微调](#预训练模型微调)。 -【方案介绍】**提示学习的主要思想是将文本分类任务转换为构造提示中掩码 `[MASK]` 的分类预测任务**,也即在掩码 `[MASK]`向量后接入线性层分类器预测掩码位置可能的字或词。提示学习使用待预测字的预训练向量来初始化分类器参数(如果待预测的是词,则为词中所有字的预训练向量平均值),充分利用预训练语言模型学习到的特征和标签文本,从而降低样本需求。提示学习同时提供[ R-Drop](https://arxiv.org/abs/2106.14448) 和 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略,帮助提示模型效果。 +【方案介绍】**提示学习的主要思想是将文本分类任务转换为构造提示中掩码 `[MASK]` 的分类预测任务**,也即在掩码 `[MASK]`向量后接入线性层分类器预测掩码位置可能的字或词。提示学习使用待预测字的预训练向量来初始化分类器参数(如果待预测的是词,则为词中所有字的预训练向量平均值),充分利用预训练语言模型学习到的特征和标签文本,从而降低样本需求。提示学习同时提供[ R-Drop](https://arxiv.org/abs/2106.14448) 和 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略,帮助提升模型效果。 我们以下图情感二分类任务为例来具体介绍提示学习流程,分类任务标签分为 `0:负向` 和 `1:正向` 。在文本加入构造提示 `我[MASK]喜欢。` ,将情感分类任务转化为预测掩码 `[MASK]` 的待预测字是 `不` 还是 `很`。具体实现方法是在掩码`[MASK]`的输出向量后接入线性分类器(二分类),然后用`不`和`很`的预训练向量来初始化分类器进行训练,分类器预测分类为 `0:不` 或 `1:很` 对应原始标签 `0:负向` 或 `1:正向`。而预训练模型微调则是在预训练模型`[CLS]`向量接入随机初始化线性分类器进行训练,分类器直接预测分类为 `0:负向` 或 `1:正向`。
- +
【方案效果】我们比较预训练模型微调与提示学习在多分类、多标签、层次分类小样本场景的模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值),可以看到在样本较少的情况下,提示学习比预训练模型微调有明显优势。
- 文本分类落地难点 + 文本分类落地难点
@@ -108,6 +108,10 @@ 【方案介绍】语义索引目标是从海量候选召回集中快速、准确地召回一批与输入文本语义相关的文本。基于语义索引的文本分类方法具体来说是将标签集作为召回目标集,召回与输入文本语义相似的标签作为文本的标签类别。 +
+ +
+ 【快速开始】 - 快速开启多分类任务参见 👉 [语义索引-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/retrieval_based#readme) - 快速开启多标签分类任务参见 👉 [语义索引-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/retrieval_based#readme) @@ -136,24 +140,35 @@ 有这么一句话在业界广泛流传,"数据决定了机器学习的上限,而模型和算法只是逼近这个上限",可见数据质量的重要性。文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)开源了模型分析模块,针对标注数据质量不高、训练数据覆盖不足、样本数量少等文本分类常见数据痛点,提供稀疏数据筛选、脏数据清洗、数据增强三种数据优化方案,解决训练数据缺陷问题,用低成本方式获得大幅度的效果提升。 -- **稀疏数据筛选**基于特征相似度的实例级证据分析方法挖掘待预测数据中缺乏证据支持的数据(也即稀疏数据),并进行有选择的训练集数据增强或针对性筛选未标注数据进行标注来解决稀疏数据问题,有效提升模型表现。我们采用在多分类、多标签、层次分类场景中评测稀疏数据-数据增强策略和稀疏数据-数据标注策略,下图表明稀疏数据筛选方案在各场景能够有效提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。 +- **稀疏数据筛选**基于特征相似度的实例级证据分析方法挖掘待预测数据中缺乏证据支持的数据(也即稀疏数据),并进行有选择的训练集数据增强或针对性筛选未标注数据进行标注来解决稀疏数据问题,有效提升模型表现。 +
+ 文本分类落地难点 +
+ +我们采用在多分类、多标签、层次分类场景中评测稀疏数据-数据增强策略和稀疏数据-数据标注策略,下图表明稀疏数据筛选方案在各场景能够有效提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。
- 文本分类落地难点 + 文本分类落地难点
-- **脏数据清洗**基于表示点方法的实例级证据分析方法,计算训练数据对模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为脏数据(标注错误样本)。脏数据清洗方案通过高效识别训练集中脏数据(也即标注质量差的数据),有效降低人力检查成本。我们采用在多分类、多标签、层次分类场景中评测脏数据清洗方案,实验表明方案能够高效筛选出训练集中脏数据,提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。 +- **脏数据清洗**基于表示点方法的实例级证据分析方法,计算训练数据对模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为脏数据(标注错误样本)。脏数据清洗方案通过高效识别训练集中脏数据(也即标注质量差的数据),有效降低人力检查成本。
- 文本分类落地难点 + 文本分类落地难点 +
+ +我们采用在多分类、多标签、层次分类场景中评测脏数据清洗方案,实验表明方案能够高效筛选出训练集中脏数据,提高模型表现(多分类精度为准确率,多标签和层次分类精度为Macro F1值)。 + +
+ 文本分类落地难点
- **数据增强**在数据量较少的情况下能够通过增加数据集多样性,提升模型效果。PaddleNLP内置[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md),支持词替换、词删除、词插入、词置换、基于上下文生成词(MLM预测)、TF-IDF等多种数据增强策略。数据增强方案提供一行命令,快速完成数据集增强。以CAIL2019—婚姻家庭要素提取数据子集(500条)为例,我们在数据集应用多种数据增强策略,策略效果如下表。
- 文本分类落地难点 + 文本分类落地难点
@@ -172,7 +187,7 @@ 文本分类应用提供了简单易用的数据标注-模型训练-模型调优-模型压缩-预测部署全流程方案,我们将以预训练模型微调方案为例介绍文本分类应用的全流程:
- image + image
@@ -198,7 +213,11 @@ **3.模型部署** -- 现实部署场景需要同时考虑模型的精度和性能表现。基于压缩API的模型裁剪能够进一步压缩模型体积,此外模型裁剪去掉了部分冗余参数的扰动,增加了模型的泛化能力,在部分任务预测精度得到提高。 +- 现实部署场景需要同时考虑模型的精度和性能表现,文本分类应用接入PaddleNLP 模型压缩 API 。采用了DynaBERT 中宽度自适应裁剪策略,对预训练模型多头注意力机制中的头(Head )进行重要性排序,保证更重要的头(Head )不容易被裁掉,然后用原模型作为蒸馏过程中的教师模型,宽度更小的模型作为学生模型,蒸馏得到的学生模型就是我们裁剪得到的模型。实验表明模型裁剪能够有效缩小模型体积、减少内存占用、提升推理速度。模型裁剪去掉了部分冗余参数的扰动,增加了模型的泛化能力,在部分任务中预测精度得到提高。 + +
+ image +
- 模型部署需要将保存的最佳模型参数(动态图参数)导出成静态图参数,用于后续的推理部署。p.s.模型裁剪之后会默认导出静态图模型 diff --git a/applications/text_classification/hierarchical/README.md b/applications/text_classification/hierarchical/README.md index c13a704b3a7e..2585783e6d44 100644 --- a/applications/text_classification/hierarchical/README.md +++ b/applications/text_classification/hierarchical/README.md @@ -65,7 +65,7 @@ rm baidu_extract_2020.tar.gz - python >= 3.6 - paddlepaddle >= 2.3 -- paddlenlp >= 2.3.4 +- paddlenlp >= 2.4 - scikit-learn >= 1.0.2 **安装PaddlePaddle:** @@ -77,7 +77,7 @@ rm baidu_extract_2020.tar.gz 安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 ```shell -python3 -m pip install paddlenlp==2.3.4 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple ``` @@ -188,8 +188,6 @@ data/ ### 2.4 模型训练 - - #### 2.4.1 预训练模型微调 使用CPU/GPU训练,默认为GPU训练,使用CPU训练只需将设备参数配置改为`--device "cpu"`: @@ -200,19 +198,20 @@ python train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` - 如果在CPU环境下训练,可以指定`nproc_per_node`参数进行多核训练: ```shell python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \ --dataset_dir "data" \ - --device "gpu" \ + --device "cpu" \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` 如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。 @@ -225,7 +224,8 @@ python -m paddle.distributed.launch --gpus "0" train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` @@ -235,7 +235,7 @@ python -m paddle.distributed.launch --gpus "0" train.py \ * `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt,dev.txt和label.txt文件;默认为None。 * `save_dir`:保存训练模型的目录;默认保存在当前目录checkpoint文件夹下。 * `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 -* `model_name`:选择预训练模型,可选"ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-1.0-large-zh-cw";默认为"ernie-3.0-medium-zh"。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据任务复杂度和硬件条件进行选择。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 * `learning_rate`:训练最大学习率;默认为3e-5。 * `epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 @@ -263,8 +263,8 @@ checkpoint/ **NOTE:** * 如需恢复模型训练,则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams` 。 -* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en",更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)。 -* 英文和中文以外文本分类任务建议使用多语言预训练模型"ernie-m-base","ernie-m-large", 多语言模型暂不支持文本分类模型部署,相关功能正在加速开发中。 +* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 #### 2.4.2 训练评估与模型优化 训练后的模型我们可以使用 [模型分析模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: @@ -337,8 +337,13 @@ python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_ python export_model.py --params_path ./checkpoint/ --output_path ./export ``` -可支持配置的参数: +如果使用ERNIE M作为预训练模型,运行方式: +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual +``` +可支持配置的参数: +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 * `params_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 * `output_path`:静态图图保存的参数路径;默认为"./export"。 @@ -397,9 +402,9 @@ python prune.py \ ```text prune/ ├── width_mult_0.75 -│   ├── float32.pdiparams -│   ├── float32.pdiparams.info -│   ├── float32.pdmodel +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel │   ├── model_state.pdparams │   └── model_config.json └── ... @@ -413,6 +418,7 @@ prune/ 3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 +4. **压缩API暂不支持多语言预训练模型ERNIE-M**,相关功能正在加紧开发中。 #### 2.5.3 部署方案 @@ -454,11 +460,13 @@ prune/ | | 模型结构 |Micro F1(%) | Macro F1(%) | latency(ms) | | -------------------------- | ------------ | ------------ | ------------ |------------ | -|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|95.68|93.39| 4.63 | -|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|95.26|93.22| 2.42| -|ERNIE 3.0 Mini|6-layer, 384-hidden, 12-heads|94.72|93.03| 0.93| -|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|94.24|93.08| 0.70| -|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|93.98|91.25|0.54| +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|96.24|94.24 |5.59 | +|ERNIE 3.0 Xbase |20-layer, 1024-hidden, 16-heads|96.21|94.13| 5.51 | +|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|95.68|93.39| 2.01 | +|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|95.26|93.22| 1.01| +|ERNIE 3.0 Mini|6-layer, 384-hidden, 12-heads|94.72|93.03| 0.36| +|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|94.24|93.08| 0.24| +|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|93.98|91.25|0.19| | ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 95.45|93.40| 0.81 | | ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 95.23|93.27 | 0.74 | | ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 94.92 | 92.70| 0.61 | diff --git a/applications/text_classification/hierarchical/analysis/evaluate.py b/applications/text_classification/hierarchical/analysis/evaluate.py index 8e7241939409..f0db5a5d62cd 100644 --- a/applications/text_classification/hierarchical/analysis/evaluate.py +++ b/applications/text_classification/hierarchical/analysis/evaluate.py @@ -65,8 +65,17 @@ def read_local_dataset(path, label_list): """ with open(path, 'r', encoding='utf-8') as f: for line in f: - sentence, label = line.strip().split('\t') - labels = [label_list[l] for l in label.split(',')] + items = line.strip().split('\t') + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + label = '' + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(',')] yield {"text": sentence, 'label': labels, 'label_n': label} diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/README.md b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md index 8b798ecc7b34..c47bb17df6a6 100644 --- a/applications/text_classification/hierarchical/deploy/paddle_serving/README.md +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md @@ -1,6 +1,6 @@ # 基于Paddle Serving的服务化部署 -本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署基于ERNIE 2.0的层次分类部署pipeline在线服务。 +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建层次分类在线服务部署。 ## 目录 - [环境准备](#环境准备) @@ -8,8 +8,24 @@ - [部署模型](#部署模型) ## 环境准备 -需要[准备PaddleNLP的运行环境]()和Paddle Serving的运行环境。 +需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。 +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 + +### 安装PaddlePaddle + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +### 安装PaddleNLP + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` ### 安装Paddle Serving 安装client和serving app,用于向服务发送请求: @@ -49,15 +65,17 @@ pip install faster_tokenizer 使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 -用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址--dirname根据实际填写即可。 +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址`dirname`,模型文件和参数名`model_filename`,`params_filename`根据实际填写即可。 ```shell python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams ``` + 可以通过命令查参数含义: ```shell python -m paddle_serving_client.convert --help ``` + 转换成功后的目录如下: ``` paddle_serving/ @@ -94,25 +112,31 @@ serving/ # 修改模型目录为下载的模型目录或自己的模型目录: model_config: serving_server => model_config: erine-3.0-tiny/serving_server -# 修改rpc端口号为9998 -rpc_port: 9998 => rpc_port: 9998 +# 修改rpc端口号 +rpc_port: 10231 => rpc_port: 9998 # 修改使用GPU推理为使用CPU推理: device_type: 1 => device_type: 0 +#开启MKLDNN加速 +#use_mkldnn: False => use_mkldnn: True + #Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 fetch_list: ["linear_147.tmp_1"] => fetch_list: ["linear_75.tmp_1"] - -#开启MKLDNN加速 -#use_mkldnn: True => use_mkldnn: True ``` + ### 分类任务 #### 启动服务 修改好配置文件后,执行下面命令启动服务: ```shell -python service.py +python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh" ``` + +可支持配置的参数: +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 + 输出打印如下: ``` [DAG] Succ init diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml index a44f9a68c33b..3133fa7c284d 100644 --- a/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml @@ -1,8 +1,8 @@ #rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 -rpc_port: 18090 +rpc_port: 7688 #http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port -http_port: 9999 +http_port: 9998 #worker_num, 最大并发数。 #当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py index 25f7cf5613ba..4ae6a8fd1d0e 100644 --- a/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py @@ -37,7 +37,7 @@ def Run(self, data): if __name__ == "__main__": - server_url = "127.0.0.1:18090" + server_url = "127.0.0.1:7688" runner = Runner(server_url) texts = [ "消失的“外企光环”,5月份在华裁员900余人,香饽饽变“臭”了?", "卡车超载致使跨桥侧翻,没那么简单", diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/service.py b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py index e841f60fb578..f7ead1b9ddb6 100644 --- a/applications/text_classification/hierarchical/deploy/paddle_serving/service.py +++ b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py @@ -12,26 +12,48 @@ # See the License for the specific language governing permissions and # limitations under the License. -from paddle_serving_server.web_service import WebService, Op - -from numpy import array - +import argparse import logging import numpy as np +from numpy import array +from paddle_serving_server.web_service import WebService, Op + +from paddlenlp.transformers import AutoTokenizer _LOGGER = logging.getLogger() +FETCH_NAME_MAP = { + "ernie-1.0-large-zh-cw": "linear_291.tmp_1", + "ernie-3.0-xbase-zh": "linear_243.tmp_1", + "ernie-3.0-base-zh": "linear_147.tmp_1", + "ernie-3.0-medium-zh": "linear_75.tmp_1", + "ernie-3.0-mini-zh": "linear_75.tmp_1", + "ernie-3.0-micro-zh": "linear_51.tmp_1", + "ernie-3.0-nano-zh": "linear_51.tmp_1", + "ernie-2.0-base-en": "linear_147.tmp_1", + "ernie-2.0-large-en": "linear_291.tmp_1", + "ernie-m-base": "linear_147.tmp_1", + "ernie-m-large": "linear_291.tmp_1", +} + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) +args = parser.parse_args() +# yapf: enable + class Op(Op): def init_op(self): - from paddlenlp.transformers import AutoTokenizer - self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_faster=True) # Output nodes may differ from model to model # You can see the output node name in the conf.prototxt file of serving_server self.fetch_names = [ - "linear_75.tmp_1", + FETCH_NAME_MAP[args.model_name], ] def preprocess(self, input_dicts, data_id, log_id): @@ -46,16 +68,16 @@ def preprocess(self, input_dicts, data_id, log_id): # tokenizer + pad data = self.tokenizer(data, - max_length=512, + max_length=args.max_seq_length, padding=True, - truncation=True) - input_ids = data["input_ids"] - token_type_ids = data["token_type_ids"] - - return { - "input_ids": np.array(input_ids, dtype="int64"), - "token_type_ids": np.array(token_type_ids, dtype="int64") - }, False, None, "" + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + return tokenized_data, False, None, "" def postprocess(self, input_dicts, fetch_dict, data_id, log_id): diff --git a/applications/text_classification/hierarchical/deploy/predictor/README.md b/applications/text_classification/hierarchical/deploy/predictor/README.md index c1904e9a43ac..caff6498386e 100644 --- a/applications/text_classification/hierarchical/deploy/predictor/README.md +++ b/applications/text_classification/hierarchical/deploy/predictor/README.md @@ -19,6 +19,12 @@ python -m pip install onnxruntime-gpu onnx onnxconverter-common psutil python -m pip install onnxruntime psutil ``` +安装FasterTokenizer文本处理加速库(可选) +推荐安装faster_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install faster_tokenizer +``` + ## 基于GPU部署推理样例 请使用如下命令进行部署 ``` @@ -34,7 +40,7 @@ python infer.py \ 可支持配置的参数: * `model_path_prefix`:必须,待推理模型路径前缀。 -* `model_name_or_path`:选择预训练模型;默认为"ernie-3.0-medium-zh"。 +* `model_name_or_path`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 * `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 * `use_fp16`:选择是否开启FP16进行加速;默认为False。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 @@ -159,8 +165,8 @@ python infer.py \ | | Micro F1(%) | Macro F1(%) | latency(ms) | | -------------------------- | ------------ | ------------- |------------- | -| ERNIE 3.0 Medium+FP32+GPU | 95.26|93.22| 2.42| -| ERNIE 3.0 Medium+FP16+GPU | 95.26|93.22| 0.79| +| ERNIE 3.0 Medium+FP32+GPU | 95.26|93.22| 1.01| +| ERNIE 3.0 Medium+FP16+GPU | 95.26|93.22| 0.38| | ERNIE 3.0 Medium+FP32+CPU | 95.26|93.22| 18.93 | | ERNIE 3.0 Medium+INT8+CPU | 95.03 | 92.87| 12.14 | diff --git a/applications/text_classification/hierarchical/deploy/predictor/infer.py b/applications/text_classification/hierarchical/deploy/predictor/infer.py index 776e038d82c7..303b946a2d8b 100644 --- a/applications/text_classification/hierarchical/deploy/predictor/infer.py +++ b/applications/text_classification/hierarchical/deploy/predictor/infer.py @@ -25,7 +25,8 @@ # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") -parser.add_argument("--model_name_or_path", default="ernie-3.0-medium-zh", type=str, help="The directory or name of model.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.") @@ -41,12 +42,19 @@ def read_local_dataset(path, label_list): - """Read dataset""" label_list_dict = {label_list[i]: i for i in range(len(label_list))} with open(path, 'r', encoding='utf-8') as f: for line in f: - sentence, label = line.strip().split('\t') - labels = [label_list_dict[l] for l in label.split(',')] + items = line.strip().split('\t') + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list_dict[l] for l in label.split(',')] yield {'sentence': sentence, 'label': labels} diff --git a/applications/text_classification/hierarchical/deploy/predictor/predictor.py b/applications/text_classification/hierarchical/deploy/predictor/predictor.py index 0805014f5926..b9dd5f6a7d3f 100644 --- a/applications/text_classification/hierarchical/deploy/predictor/predictor.py +++ b/applications/text_classification/hierarchical/deploy/predictor/predictor.py @@ -101,10 +101,6 @@ def __init__(self, onnx_model, sess_options=sess_options, providers=['CPUExecutionProvider']) - input_name1 = self.predictor.get_inputs()[1].name - input_name2 = self.predictor.get_inputs()[0].name - self.input_handles = [input_name1, input_name2] - logger.info(">>> [InferBackend] Engine Created ...") def dynamic_quantize(self, input_float_model, dynamic_quantized_model): @@ -143,12 +139,15 @@ def preprocess(self, input_data: list): data = self.tokenizer(input_data, max_length=self.max_seq_length, padding=True, - truncation=True) + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: - return { - "input_ids": np.array(data["input_ids"], dtype="int64"), - "token_type_ids": np.array(data["token_type_ids"], dtype="int64") - } + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + return tokenized_data def postprocess(self, infer_data): threshold = 0.5 @@ -178,17 +177,13 @@ def infer_batch(self, preprocess_result): infer_result = None for i in range(0, sample_num, self.batch_size): batch_size = min(self.batch_size, sample_num - i) - input_ids = [ - preprocess_result["input_ids"][i + j] for j in range(batch_size) - ] - token_type_ids = [ - preprocess_result["token_type_ids"][i + j] - for j in range(batch_size) - ] - preprocess_result_batch = { - "input_ids": input_ids, - "token_type_ids": token_type_ids - } + preprocess_result_batch = {} + for tokenizer_key in preprocess_result: + preprocess_result_batch[tokenizer_key] = [ + preprocess_result[tokenizer_key][i + j] + for j in range(batch_size) + ] + result = self.infer(preprocess_result_batch) if infer_result is None: infer_result = result diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt index 89e7c54bb2ea..0fb1417cba37 100755 --- a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt @@ -16,7 +16,7 @@ output [ { name: "linear_75.tmp_1" data_type: TYPE_FP32 - dims: [ 141 ] + dims: [ 74 ] } ] diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt index a7a17d8f0121..fbeda7129f92 100644 --- a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt @@ -6,7 +6,7 @@ input [ { name: "POST_INPUT" data_type: TYPE_FP32 - dims: [ 141 ] + dims: [ 74 ] } ] diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py index 896a8e75fa1d..2ec5d430f270 100644 --- a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py +++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py @@ -33,7 +33,7 @@ def initialize(self, args): * model_version: Model version * model_name: Model name """ - self.tokenizer = AutoTokenizer.from_pretrained("ernie-2.0-base-en", + self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_faster=True) # You must parse model_config. JSON string is not parsed here self.model_config = json.loads(args['model_config']) @@ -72,7 +72,6 @@ def execute(self, requests): be the same as `requests` """ responses = [] - # print("num:", len(requests), flush=True) for request in requests: data = pb_utils.get_input_tensor_by_name(request, self.input_names[0]) @@ -86,9 +85,6 @@ def execute(self, requests): token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1]) - # print("input_ids:", input_ids) - # print("token_type_ids:", token_type_ids) - out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids) out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids) inference_response = pb_utils.InferenceResponse( diff --git a/applications/text_classification/hierarchical/export_model.py b/applications/text_classification/hierarchical/export_model.py index d05a8aa937c7..ea7a94febba5 100644 --- a/applications/text_classification/hierarchical/export_model.py +++ b/applications/text_classification/hierarchical/export_model.py @@ -20,6 +20,7 @@ # yapf: disable parser = argparse.ArgumentParser() +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.") parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") args = parser.parse_args() @@ -29,16 +30,23 @@ model = AutoModelForSequenceClassification.from_pretrained(args.params_path) model.eval() - - # Convert to static graph with specific input description - model = paddle.jit.to_static( - model, - input_spec=[ + if args.multilingual: + input_spec = [ paddle.static.InputSpec(shape=[None, None], - dtype="int64"), # input_ids + dtype="int64", + name='input_ids') + ] + else: + input_spec = [ paddle.static.InputSpec(shape=[None, None], - dtype="int64") # segment_ids - ]) + dtype="int64", + name='input_ids'), + paddle.static.InputSpec(shape=[None, None], + dtype="int64", + name='token_type_ids') + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) # Save in static graph model. save_path = os.path.join(args.output_path, "float32") diff --git a/applications/text_classification/hierarchical/train.py b/applications/text_classification/hierarchical/train.py index 0e3a7e48ef91..b0c83b45b4f8 100644 --- a/applications/text_classification/hierarchical/train.py +++ b/applications/text_classification/hierarchical/train.py @@ -40,10 +40,10 @@ parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", - choices=["ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") -parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.') parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.') parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.") diff --git a/applications/text_classification/hierarchical/utils.py b/applications/text_classification/hierarchical/utils.py index b61406c55cf2..2e2c54657e49 100644 --- a/applications/text_classification/hierarchical/utils.py +++ b/applications/text_classification/hierarchical/utils.py @@ -91,7 +91,13 @@ def read_local_dataset(path, label_list=None, is_test=False): yield {'sentence': sentence} else: items = line.strip().split('\t') - sentence = ''.join(items[:-1]) - label = items[-1] - labels = [label_list[l] for l in label.split(',')] + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(',')] yield {'sentence': sentence, 'label': labels} diff --git a/applications/text_classification/multi_class/README.md b/applications/text_classification/multi_class/README.md index 57895c9c95a2..e4a45a80760a 100644 --- a/applications/text_classification/multi_class/README.md +++ b/applications/text_classification/multi_class/README.md @@ -68,7 +68,7 @@ rm KUAKE_QIC.tar.gz - python >= 3.6 - paddlepaddle >= 2.3 -- paddlenlp >= 2.3.4 +- paddlenlp >= 2.4 - scikit-learn >= 1.0.2 **安装PaddlePaddle:** @@ -80,7 +80,7 @@ rm KUAKE_QIC.tar.gz 安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 ```shell -python3 -m pip install paddlenlp==2.3.4 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple ``` @@ -203,21 +203,23 @@ python train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` 如果在CPU环境下训练,可以指定`nproc_per_node`参数进行多核训练: ```shell python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \ --dataset_dir "data" \ - --device "gpu" \ + --device "cpu" \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` -如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。 +如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况: ```shell unset CUDA_VISIBLE_DEVICES @@ -227,17 +229,17 @@ python -m paddle.distributed.launch --gpus "0" train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` - 可支持配置的参数: * `device`: 选用什么设备进行训练,选择cpu、gpu、xpu、npu。如使用gpu训练,可使用参数--gpus指定GPU卡号;默认为"gpu"。 * `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt,dev.txt和label.txt文件;默认为None。 * `save_dir`:保存训练模型的目录;默认保存在当前目录checkpoint文件夹下。 * `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 -* `model_name`:选择预训练模型,可选"ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-1.0-large-zh-cw","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh"。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh"。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 * `learning_rate`:训练最大学习率;默认为3e-5。 * `epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 @@ -266,8 +268,9 @@ checkpoint/ **NOTE:** * 如需恢复模型训练,则可以设置 `init_from_ckpt` , 如 `init_from_ckpt=checkpoint/model_state.pdparams` 。 -* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en",更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)。 -* 英文和中文以外文本分类任务建议使用多语言预训练模型"ernie-m-base","ernie-m-large", 多语言模型暂不支持文本分类模型部署,相关功能正在加速开发中。 +* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 + #### 2.4.2 训练评估与模型优化 训练后的模型我们可以使用 [模型分析模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: @@ -335,8 +338,13 @@ python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_ python export_model.py --params_path ./checkpoint/ --output_path ./export ``` -可支持配置的参数: +如果使用ERNIE M作为预训练模型,运行方式: +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual +``` +可支持配置的参数: +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 * `params_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 * `output_path`:静态图图保存的参数路径;默认为"./export"。 @@ -397,9 +405,9 @@ python prune.py \ ```text prune/ ├── width_mult_0.75 -│   ├── float32.pdiparams -│   ├── float32.pdiparams.info -│   ├── float32.pdmodel +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel │   ├── model_state.pdparams │   └── model_config.json └── ... @@ -413,7 +421,7 @@ prune/ 3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 - +4. **压缩API暂不支持多语言预训练模型ERNIE-M**,相关功能正在加紧开发中。 #### 2.5.3 部署方案 @@ -456,6 +464,7 @@ PaddleNLP提供ERNIE 3.0 全系列轻量化模型,对于中文训练任务可 | model_name | 模型结构 |Accuracy(%) | latency(ms) | | -------------------------- | ------------ | ------------ | ------------ | +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|82.30| 5.62 | |ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|82.25| 2.07 | |ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|81.79| 1.07| |ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|79.80| 0.38| diff --git a/applications/text_classification/multi_class/deploy/paddle_serving/README.md b/applications/text_classification/multi_class/deploy/paddle_serving/README.md index cb99994b6a71..3413181ef73d 100644 --- a/applications/text_classification/multi_class/deploy/paddle_serving/README.md +++ b/applications/text_classification/multi_class/deploy/paddle_serving/README.md @@ -1,6 +1,6 @@ # 基于Paddle Serving的服务化部署 -本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署基于ERNIE 3.0的多分类部署pipeline在线服务。 +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建多分类在线服务部署。 ## 目录 - [环境准备](#环境准备) @@ -8,8 +8,24 @@ - [部署模型](#部署模型) ## 环境准备 -需要Paddle Serving的运行环境。 +需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。 +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 + +### 安装PaddlePaddle + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +### 安装PaddleNLP + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` ### 安装Paddle Serving 安装client和serving app,用于向服务发送请求: ```shell @@ -47,7 +63,7 @@ pip install faster_tokenizer 使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 -用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址--dirname根据实际填写即可。 +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址`dirname`,模型文件和参数名`model_filename`,`params_filename`根据实际填写即可。 ```shell python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams @@ -92,25 +108,30 @@ serving/ # 修改模型目录为下载的模型目录或自己的模型目录: model_config: serving_server => model_config: erine-3.0-tiny/serving_server -# 修改rpc端口号为9998 -rpc_port: 9998 => rpc_port: 9998 +# 修改rpc端口号 +rpc_port: 10231 => rpc_port: 9998 # 修改使用GPU推理为使用CPU推理: device_type: 1 => device_type: 0 -#Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 -fetch_list: ["linear_75.tmp_1"] => fetch_list: ["linear_147.tmp_1"] - #开启MKLDNN加速 -#use_mkldnn: True => use_mkldnn: True +#use_mkldnn: False => use_mkldnn: True + +#Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 +fetch_list: ["linear_147.tmp_1"] => fetch_list: ["linear_75.tmp_1"] ``` ### 分类任务 #### 启动服务 修改好配置文件后,执行下面命令启动服务: ```shell -python service.py +python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh" ``` + +可支持配置的参数: +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 + 输出打印如下: ``` [DAG] Succ init diff --git a/applications/text_classification/multi_class/deploy/paddle_serving/config.yml b/applications/text_classification/multi_class/deploy/paddle_serving/config.yml index a44f9a68c33b..62a1a3056b82 100644 --- a/applications/text_classification/multi_class/deploy/paddle_serving/config.yml +++ b/applications/text_classification/multi_class/deploy/paddle_serving/config.yml @@ -2,7 +2,7 @@ rpc_port: 18090 #http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port -http_port: 9999 +http_port: 9878 #worker_num, 最大并发数。 #当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG diff --git a/applications/text_classification/multi_class/deploy/paddle_serving/service.py b/applications/text_classification/multi_class/deploy/paddle_serving/service.py index caa949ac538c..ca889858c720 100644 --- a/applications/text_classification/multi_class/deploy/paddle_serving/service.py +++ b/applications/text_classification/multi_class/deploy/paddle_serving/service.py @@ -12,26 +12,48 @@ # See the License for the specific language governing permissions and # limitations under the License. -from paddle_serving_server.web_service import WebService, Op - -from numpy import array - +import argparse import logging import numpy as np +from numpy import array +from paddle_serving_server.web_service import WebService, Op + +from paddlenlp.transformers import AutoTokenizer _LOGGER = logging.getLogger() +FETCH_NAME_MAP = { + "ernie-1.0-large-zh-cw": "linear_291.tmp_1", + "ernie-3.0-xbase-zh": "linear_243.tmp_1", + "ernie-3.0-base-zh": "linear_147.tmp_1", + "ernie-3.0-medium-zh": "linear_75.tmp_1", + "ernie-3.0-mini-zh": "linear_75.tmp_1", + "ernie-3.0-micro-zh": "linear_51.tmp_1", + "ernie-3.0-nano-zh": "linear_51.tmp_1", + "ernie-2.0-base-en": "linear_147.tmp_1", + "ernie-2.0-large-en": "linear_291.tmp_1", + "ernie-m-base": "linear_147.tmp_1", + "ernie-m-large": "linear_291.tmp_1", +} + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) +args = parser.parse_args() +# yapf: enable + class Op(Op): def init_op(self): - from paddlenlp.transformers import AutoTokenizer - self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_faster=True) # Output nodes may differ from model to model # You can see the output node name in the conf.prototxt file of serving_server self.fetch_names = [ - "linear_75.tmp_1", + FETCH_NAME_MAP[args.model_name], ] def preprocess(self, input_dicts, data_id, log_id): @@ -46,16 +68,17 @@ def preprocess(self, input_dicts, data_id, log_id): # tokenizer + pad data = self.tokenizer(data, - max_length=128, + max_length=args.max_seq_length, padding=True, - truncation=True) - input_ids = data["input_ids"] - token_type_ids = data["token_type_ids"] - - return { - "input_ids": np.array(input_ids, dtype="int64"), - "token_type_ids": np.array(token_type_ids, dtype="int64") - }, False, None, "" + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + + return tokenized_data, False, None, "" def postprocess(self, input_dicts, fetch_dict, data_id, log_id): diff --git a/applications/text_classification/multi_class/deploy/predictor/README.md b/applications/text_classification/multi_class/deploy/predictor/README.md index 6b4c31b656d7..8959571cb6ab 100644 --- a/applications/text_classification/multi_class/deploy/predictor/README.md +++ b/applications/text_classification/multi_class/deploy/predictor/README.md @@ -20,7 +20,11 @@ python -m pip install onnxruntime-gpu onnx onnxconverter-common python -m pip install onnxruntime ``` - +安装FasterTokenizer文本处理加速库(可选) +推荐安装faster_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install faster_tokenizer +``` ## 基于GPU部署推理样例 请使用如下命令进行部署 @@ -37,7 +41,7 @@ python infer.py \ 可支持配置的参数: * `model_path_prefix`:必须,待推理模型路径前缀。 -* `model_name_or_path`:选择预训练模型;默认为"ernie-3.0-medium-zh"。 +* `model_name_or_path`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 * `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 * `use_fp16`:选择是否开启FP16进行加速;默认为False。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 diff --git a/applications/text_classification/multi_class/deploy/predictor/infer.py b/applications/text_classification/multi_class/deploy/predictor/infer.py index 8ed68ac6897c..591bfd8254a4 100644 --- a/applications/text_classification/multi_class/deploy/predictor/infer.py +++ b/applications/text_classification/multi_class/deploy/predictor/infer.py @@ -25,7 +25,8 @@ # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") -parser.add_argument("--model_name_or_path", default="ernie-3.0-medium-zh", type=str, help="The directory or name of model.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.") diff --git a/applications/text_classification/multi_class/deploy/predictor/predictor.py b/applications/text_classification/multi_class/deploy/predictor/predictor.py index d70cf1d2651a..4aca23b9c00a 100644 --- a/applications/text_classification/multi_class/deploy/predictor/predictor.py +++ b/applications/text_classification/multi_class/deploy/predictor/predictor.py @@ -101,9 +101,6 @@ def __init__(self, onnx_model, sess_options=sess_options, providers=['CPUExecutionProvider']) - input_name1 = self.predictor.get_inputs()[1].name - input_name2 = self.predictor.get_inputs()[0].name - self.input_handles = [input_name1, input_name2] logger.info(">>> [InferBackend] Engine Created ...") @@ -135,12 +132,14 @@ def preprocess(self, input_data: list): data = self.tokenizer(input_data, max_length=self.max_seq_length, padding=True, - truncation=True) - - return { - "input_ids": np.array(data["input_ids"], dtype="int64"), - "token_type_ids": np.array(data["token_type_ids"], dtype="int64") - } + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + return tokenized_data def postprocess(self, infer_data): @@ -160,17 +159,13 @@ def infer_batch(self, preprocess_result): infer_result = None for i in range(0, sample_num, self.batch_size): batch_size = min(self.batch_size, sample_num - i) - input_ids = [ - preprocess_result["input_ids"][i + j] for j in range(batch_size) - ] - token_type_ids = [ - preprocess_result["token_type_ids"][i + j] - for j in range(batch_size) - ] - preprocess_result_batch = { - "input_ids": input_ids, - "token_type_ids": token_type_ids - } + preprocess_result_batch = {} + for tokenizer_key in preprocess_result: + preprocess_result_batch[tokenizer_key] = [ + preprocess_result[tokenizer_key][i + j] + for j in range(batch_size) + ] + result = self.infer(preprocess_result_batch) if infer_result is None: infer_result = result diff --git a/applications/text_classification/multi_class/export_model.py b/applications/text_classification/multi_class/export_model.py index d05a8aa937c7..ea7a94febba5 100644 --- a/applications/text_classification/multi_class/export_model.py +++ b/applications/text_classification/multi_class/export_model.py @@ -20,6 +20,7 @@ # yapf: disable parser = argparse.ArgumentParser() +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.") parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") args = parser.parse_args() @@ -29,16 +30,23 @@ model = AutoModelForSequenceClassification.from_pretrained(args.params_path) model.eval() - - # Convert to static graph with specific input description - model = paddle.jit.to_static( - model, - input_spec=[ + if args.multilingual: + input_spec = [ paddle.static.InputSpec(shape=[None, None], - dtype="int64"), # input_ids + dtype="int64", + name='input_ids') + ] + else: + input_spec = [ paddle.static.InputSpec(shape=[None, None], - dtype="int64") # segment_ids - ]) + dtype="int64", + name='input_ids'), + paddle.static.InputSpec(shape=[None, None], + dtype="int64", + name='token_type_ids') + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) # Save in static graph model. save_path = os.path.join(args.output_path, "float32") diff --git a/applications/text_classification/multi_class/train.py b/applications/text_classification/multi_class/train.py index b709a1473438..0d27625a63ed 100644 --- a/applications/text_classification/multi_class/train.py +++ b/applications/text_classification/multi_class/train.py @@ -40,10 +40,10 @@ parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", - choices=["ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") -parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.') parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.') parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.") diff --git a/applications/text_classification/multi_label/README.md b/applications/text_classification/multi_label/README.md index a3b948990321..e6f8dee3ea26 100644 --- a/applications/text_classification/multi_label/README.md +++ b/applications/text_classification/multi_label/README.md @@ -67,7 +67,7 @@ rm divorce.tar.gz - python >= 3.6 - paddlepaddle >= 2.3 -- paddlenlp >= 2.3.4 +- paddlenlp >= 2.4 - scikit-learn >= 1.0.2 **安装PaddlePaddle:** @@ -79,7 +79,7 @@ rm divorce.tar.gz 安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 ```shell -python3 -m pip install paddlenlp==2.3.4 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple ``` @@ -200,7 +200,8 @@ python train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` @@ -208,11 +209,12 @@ python train.py \ ```shell python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \ --dataset_dir "data" \ - --device "gpu" \ + --device "cpu" \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` 如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。 @@ -225,7 +227,8 @@ python -m paddle.distributed.launch --gpus "0" train.py \ --max_seq_length 128 \ --model_name "ernie-3.0-medium-zh" \ --batch_size 32 \ - --early_stop + --early_stop \ + --epochs 100 ``` 可支持配置的参数: @@ -233,7 +236,7 @@ python -m paddle.distributed.launch --gpus "0" train.py \ * `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt,dev.txt和label.txt文件;默认为None。 * `save_dir`:保存训练模型的目录;默认保存在当前目录checkpoint文件夹下。 * `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 -* `model_name`:选择预训练模型,可选"ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-1.0-large-zh-cw","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh"。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh"。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 * `learning_rate`:训练最大学习率;默认为3e-5。 * `epochs`: 训练轮次,使用早停法时可以选择100;默认为10。 @@ -261,8 +264,9 @@ checkpoint/ **NOTE:** * 如需恢复模型训练,则可以设置 `init_from_ckpt` , 如 `init_from_ckpt=checkpoint/model_state.pdparams` 。 -* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en",更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)。 -* 英文和中文以外文本分类任务建议使用多语言预训练模型"ernie-m-base","ernie-m-large", 多语言模型暂不支持文本分类模型部署,相关功能正在加速开发中。 +* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。 +* 英文和中文以外语言的文本分类任务,推荐使用基于96种语言(涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言)进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large",详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。 + #### 2.4.2 训练评估与模型优化 训练后的模型我们可以使用 [模型分析模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`: @@ -331,9 +335,13 @@ python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_ ```shell python export_model.py --params_path ./checkpoint/ --output_path ./export ``` +如果使用ERNIE M作为预训练模型,运行方式: +```shell +python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual +``` 可支持配置的参数: - +* `multilingual`:是否为多语言任务(是否使用ERNIE M作为预训练模型);默认为False。 * `params_path`:动态图训练保存的参数路径;默认为"./checkpoint/"。 * `output_path`:静态图图保存的参数路径;默认为"./export"。 @@ -393,9 +401,9 @@ python prune.py \ ```text prune/ ├── width_mult_0.75 -│   ├── float32.pdiparams -│   ├── float32.pdiparams.info -│   ├── float32.pdmodel +│   ├── pruned_model.pdiparams +│   ├── pruned_model.pdiparams.info +│   ├── pruned_model.pdmodel │   ├── model_state.pdparams │   └── model_config.json └── ... @@ -409,6 +417,7 @@ prune/ 3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度(multi head数量)为12,ERNIE Xbase、Large 模型宽度(multi head数量)为16,保留比例`width_mult`乘以宽度(multi haed数量)应为整数。 +4. **压缩API暂不支持多语言预训练模型ERNIE-M**,相关功能正在加紧开发中。 #### 2.5.3 部署方案 @@ -450,6 +459,7 @@ prune/ | model_name | 模型结构 |Micro F1(%) | Macro F1(%) | latency(ms) | | -------------------------- | ------------ | ------------ | ------------ |------------ | +|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|91.14|81.68 |5.66 | |ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|90.38|80.14| 2.70 | |ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|90.57|79.36| 1.46| |ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|89.27|76.78| 0.56| diff --git a/applications/text_classification/multi_label/analysis/evaluate.py b/applications/text_classification/multi_label/analysis/evaluate.py index 24fb020f5b0f..b79127c70426 100644 --- a/applications/text_classification/multi_label/analysis/evaluate.py +++ b/applications/text_classification/multi_label/analysis/evaluate.py @@ -65,8 +65,17 @@ def read_local_dataset(path, label_list): """ with open(path, 'r', encoding='utf-8') as f: for line in f: - sentence, label = line.strip().split('\t') - labels = [label_list[l] for l in label.split(',')] + items = line.strip().split('\t') + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + label = '' + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(',')] yield {"text": sentence, 'label': labels, 'label_n': label} diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/README.md b/applications/text_classification/multi_label/deploy/paddle_serving/README.md index 9550c2957eb0..a999c4716e08 100644 --- a/applications/text_classification/multi_label/deploy/paddle_serving/README.md +++ b/applications/text_classification/multi_label/deploy/paddle_serving/README.md @@ -1,6 +1,6 @@ # 基于Paddle Serving的服务化部署 -本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署基于ERNIE 3.0的分类部署pipeline在线服务。 +本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建多标签在线服务部署。 ## 目录 - [环境准备](#环境准备) @@ -8,8 +8,24 @@ - [部署模型](#部署模型) ## 环境准备 -需要[准备PaddleNLP的运行环境]()和Paddle Serving的运行环境。 +需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。 +- python >= 3.6 +- paddlepaddle >= 2.3 +- paddlenlp >= 2.4 + +### 安装PaddlePaddle + + 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。 + + +### 安装PaddleNLP + +安装PaddleNLP默认开启百度镜像源来加速下载,如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple),更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。 + +```shell +python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple +``` ### 安装Paddle Serving 安装client和serving app,用于向服务发送请求: ``` @@ -46,11 +62,12 @@ pip install faster_tokenizer 使用Paddle Serving做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 -用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址--dirname根据实际填写即可。 +用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md),模型地址`dirname`,模型文件和参数名`model_filename`,`params_filename`根据实际填写即可。 ```shell python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams ``` + 可以通过命令查参数含义: ```shell python -m paddle_serving_client.convert --help @@ -91,24 +108,30 @@ serving/ # 修改模型目录为下载的模型目录或自己的模型目录: model_config: serving_server => model_config: erine-3.0-tiny/serving_server -# 修改rpc端口号为9998: -rpc_port: 9998 => rpc_port: 9998 +# 修改rpc端口号 +rpc_port: 10231 => rpc_port: 9998 # 修改使用GPU推理为使用CPU推理: device_type: 1 => device_type: 0 +#开启MKLDNN加速 +#use_mkldnn: False => use_mkldnn: True + #Fetch结果列表,以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准 fetch_list: ["linear_147.tmp_1"] => fetch_list: ["linear_75.tmp_1"] - -#开启MKLDNN加速 -#use_mkldnn: True => use_mkldnn: True ``` + ### 分类任务 #### 启动服务 修改好配置文件后,执行下面命令启动服务: ```shell -python service.py +python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh" ``` + +可支持配置的参数: +* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。 +* `model_name`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 + 输出打印如下: ``` [DAG] Succ init diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/config.yml b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml index a44f9a68c33b..564dcf27ab11 100644 --- a/applications/text_classification/multi_label/deploy/paddle_serving/config.yml +++ b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml @@ -2,7 +2,7 @@ rpc_port: 18090 #http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port -http_port: 9999 +http_port: 5594 #worker_num, 最大并发数。 #当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/service.py b/applications/text_classification/multi_label/deploy/paddle_serving/service.py index 71bb42a58596..4a37c14ce97e 100644 --- a/applications/text_classification/multi_label/deploy/paddle_serving/service.py +++ b/applications/text_classification/multi_label/deploy/paddle_serving/service.py @@ -12,26 +12,48 @@ # See the License for the specific language governing permissions and # limitations under the License. -from paddle_serving_server.web_service import WebService, Op - -from numpy import array - +import argparse import logging import numpy as np +from numpy import array +from paddle_serving_server.web_service import WebService, Op + +from paddlenlp.transformers import AutoTokenizer _LOGGER = logging.getLogger() +FETCH_NAME_MAP = { + "ernie-1.0-large-zh-cw": "linear_291.tmp_1", + "ernie-3.0-xbase-zh": "linear_243.tmp_1", + "ernie-3.0-base-zh": "linear_147.tmp_1", + "ernie-3.0-medium-zh": "linear_75.tmp_1", + "ernie-3.0-mini-zh": "linear_75.tmp_1", + "ernie-3.0-micro-zh": "linear_51.tmp_1", + "ernie-3.0-nano-zh": "linear_51.tmp_1", + "ernie-2.0-base-en": "linear_147.tmp_1", + "ernie-2.0-large-en": "linear_291.tmp_1", + "ernie-m-base": "linear_147.tmp_1", + "ernie-m-large": "linear_291.tmp_1", +} + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) +args = parser.parse_args() +# yapf: enable + class Op(Op): def init_op(self): - from paddlenlp.transformers import AutoTokenizer - self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", + self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_faster=True) # Output nodes may differ from model to model # You can see the output node name in the conf.prototxt file of serving_server self.fetch_names = [ - "linear_75.tmp_1", + FETCH_NAME_MAP[args.model_name], ] def preprocess(self, input_dicts, data_id, log_id): @@ -46,15 +68,17 @@ def preprocess(self, input_dicts, data_id, log_id): # tokenizer + pad data = self.tokenizer(data, - max_length=512, + max_length=args.max_seq_length, padding=True, - truncation=True) - input_ids = data["input_ids"] - token_type_ids = data["token_type_ids"] - return { - "input_ids": np.array(input_ids, dtype="int64"), - "token_type_ids": np.array(token_type_ids, dtype="int64") - }, False, None, "" + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + + return tokenized_data, False, None, "" def postprocess(self, input_dicts, fetch_dict, data_id, log_id): diff --git a/applications/text_classification/multi_label/deploy/predictor/README.md b/applications/text_classification/multi_label/deploy/predictor/README.md index a8842c1bc3a0..4c7ff45c8aab 100644 --- a/applications/text_classification/multi_label/deploy/predictor/README.md +++ b/applications/text_classification/multi_label/deploy/predictor/README.md @@ -20,6 +20,11 @@ python -m pip install onnxruntime-gpu onnx onnxconverter-common python -m pip install onnxruntime ``` +安装FasterTokenizer文本处理加速库(可选) +推荐安装faster_tokenizer可以得到更极致的文本处理效率,进一步提升服务性能。 +```shell +pip install faster_tokenizer +``` ## 基于GPU部署推理样例 请使用如下命令进行部署 @@ -36,7 +41,7 @@ python infer.py \ 可支持配置的参数: * `model_path_prefix`:必须,待推理模型路径前缀。 -* `model_name_or_path`:选择预训练模型;默认为"ernie-3.0-medium-zh"。 +* `model_name_or_path`:选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large";默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。 * `max_seq_length`:ERNIE/BERT模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 * `use_fp16`:选择是否开启FP16进行加速;默认为False。 * `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 diff --git a/applications/text_classification/multi_label/deploy/predictor/infer.py b/applications/text_classification/multi_label/deploy/predictor/infer.py index 3697e7d79a02..303b946a2d8b 100644 --- a/applications/text_classification/multi_label/deploy/predictor/infer.py +++ b/applications/text_classification/multi_label/deploy/predictor/infer.py @@ -25,7 +25,8 @@ # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.") -parser.add_argument("--model_name_or_path", default="ernie-3.0-medium-zh", type=str, help="The directory or name of model.") +parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.") parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.") @@ -44,8 +45,16 @@ def read_local_dataset(path, label_list): label_list_dict = {label_list[i]: i for i in range(len(label_list))} with open(path, 'r', encoding='utf-8') as f: for line in f: - sentence, label = line.strip().split('\t') - labels = [label_list_dict[l] for l in label.split(',')] + items = line.strip().split('\t') + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list_dict[l] for l in label.split(',')] yield {'sentence': sentence, 'label': labels} diff --git a/applications/text_classification/multi_label/deploy/predictor/predictor.py b/applications/text_classification/multi_label/deploy/predictor/predictor.py index 36bbc6564285..b423ae42bc80 100644 --- a/applications/text_classification/multi_label/deploy/predictor/predictor.py +++ b/applications/text_classification/multi_label/deploy/predictor/predictor.py @@ -101,9 +101,6 @@ def __init__(self, onnx_model, sess_options=sess_options, providers=['CPUExecutionProvider']) - input_name1 = self.predictor.get_inputs()[1].name - input_name2 = self.predictor.get_inputs()[0].name - self.input_handles = [input_name1, input_name2] logger.info(">>> [InferBackend] Engine Created ...") @@ -143,12 +140,14 @@ def preprocess(self, input_data: list): data = self.tokenizer(input_data, max_length=self.max_seq_length, padding=True, - truncation=True) - - return { - "input_ids": np.array(data["input_ids"], dtype="int64"), - "token_type_ids": np.array(data["token_type_ids"], dtype="int64") - } + truncation=True, + return_position_ids=False, + return_attention_mask=False) + tokenized_data = {} + for tokenizer_key in data: + tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], + dtype="int64") + return tokenized_data def postprocess(self, infer_data): threshold = 0.5 @@ -178,17 +177,13 @@ def infer_batch(self, preprocess_result): infer_result = None for i in range(0, sample_num, self.batch_size): batch_size = min(self.batch_size, sample_num - i) - input_ids = [ - preprocess_result["input_ids"][i + j] for j in range(batch_size) - ] - token_type_ids = [ - preprocess_result["token_type_ids"][i + j] - for j in range(batch_size) - ] - preprocess_result_batch = { - "input_ids": input_ids, - "token_type_ids": token_type_ids - } + preprocess_result_batch = {} + for tokenizer_key in preprocess_result: + preprocess_result_batch[tokenizer_key] = [ + preprocess_result[tokenizer_key][i + j] + for j in range(batch_size) + ] + result = self.infer(preprocess_result_batch) if infer_result is None: infer_result = result diff --git a/applications/text_classification/multi_label/export_model.py b/applications/text_classification/multi_label/export_model.py index c551da35a67a..ea7a94febba5 100644 --- a/applications/text_classification/multi_label/export_model.py +++ b/applications/text_classification/multi_label/export_model.py @@ -20,27 +20,33 @@ # yapf: disable parser = argparse.ArgumentParser() +parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task') parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.") parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") args = parser.parse_args() # yapf: enable -args = parser.parse_args() - if __name__ == "__main__": model = AutoModelForSequenceClassification.from_pretrained(args.params_path) model.eval() - - # Convert to static graph with specific input description - model = paddle.jit.to_static( - model, - input_spec=[ + if args.multilingual: + input_spec = [ + paddle.static.InputSpec(shape=[None, None], + dtype="int64", + name='input_ids') + ] + else: + input_spec = [ paddle.static.InputSpec(shape=[None, None], - dtype="int64"), # input_ids + dtype="int64", + name='input_ids'), paddle.static.InputSpec(shape=[None, None], - dtype="int64") # segment_ids - ]) + dtype="int64", + name='token_type_ids') + ] + # Convert to static graph with specific input description + model = paddle.jit.to_static(model, input_spec=input_spec) # Save in static graph model. save_path = os.path.join(args.output_path, "float32") diff --git a/applications/text_classification/multi_label/train.py b/applications/text_classification/multi_label/train.py index 9bcf95f6bcde..7855ede77249 100644 --- a/applications/text_classification/multi_label/train.py +++ b/applications/text_classification/multi_label/train.py @@ -40,10 +40,10 @@ parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", - choices=["ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) + choices=["ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"]) parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.") -parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.') parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.') parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.") diff --git a/applications/text_classification/multi_label/utils.py b/applications/text_classification/multi_label/utils.py index b61406c55cf2..2e2c54657e49 100644 --- a/applications/text_classification/multi_label/utils.py +++ b/applications/text_classification/multi_label/utils.py @@ -91,7 +91,13 @@ def read_local_dataset(path, label_list=None, is_test=False): yield {'sentence': sentence} else: items = line.strip().split('\t') - sentence = ''.join(items[:-1]) - label = items[-1] - labels = [label_list[l] for l in label.split(',')] + if len(items) == 0: + continue + elif len(items) == 1: + sentence = items[0] + labels = [] + else: + sentence = ''.join(items[:-1]) + label = items[-1] + labels = [label_list[l] for l in label.split(',')] yield {'sentence': sentence, 'label': labels}