English | 简体中文
VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.
The main features are as follows:
- Integrate LayoutXLM model and PP-OCR prediction engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
This project is an open source implementation of LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding on Paddle 2.2, Included fine-tuning code on XFUND dataset.
We evaluate the algorithm on the Chinese dataset of XFUND, and the performance is as follows
Model | Task | hmean | Model download address |
---|---|---|---|
LayoutXLM | SER | 0.9038 | link |
LayoutXLM | RE | 0.7483 | link |
LayoutLMv2 | SER | 0.8544 | link |
LayoutLMv2 | RE | 0.6777 | link |
LayoutLM | SER | 0.7731 | link |
Note: The test images are from the XFUND dataset.
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: QUESTION
, ANSWER
, HEADER
- Dark purple: HEADER
- Light purple: QUESTION
- Army Green: ANSWER
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
- (1) Install PaddlePaddle
python3 -m pip install --upgrade pip
# GPU installation
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple
# CPU installation
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple
For more requirements, please refer to the instructions in Installation Documentation.
- (1) pip install PaddleOCR whl package quickly (prediction only)
python3 -m pip install paddleocr
- (2) Download VQA source code (prediction + training)
[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR
# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
git clone https://gitee.com/paddlepaddle/PaddleOCR
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
- (3) Install VQA's
requirements
python3 -m pip install -r ppstructure/vqa/requirements.txt
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
- Download the processed dataset
The download address of the processed XFUND Chinese dataset: https://paddleocr.bj.bcebos.com/dataset/XFUND.tar.
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar
- Convert the dataset
If you need to train other XFUND datasets, you can use the following commands to convert the datasets
python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path
- Download the pretrained models
mkdir pretrain && cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../
Before starting training, you need to modify the following four fields
Train.dataset.data_dir
: point to the directory where the training set images are storedTrain.dataset.label_file_list
: point to the training set label fileEval.dataset.data_dir
: refers to the directory where the validation set images are storedEval.dataset.label_file_list
: point to the validation set label file
- start training
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml
Finally, precision
, recall
, hmean
and other indicators will be printed.
In the ./output/ser_layoutxlm/
folder will save the training log, the optimal model and the model for the latest epoch.
- resume training
To resume training, assign the folder path of the previously trained model to the Architecture.Backbone.checkpoints
field.
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
- evaluate
Evaluation requires assigning the folder path of the model to be evaluated to the Architecture.Backbone.checkpoints
field.
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Finally, precision
, recall
, hmean
and other indicators will be printed
- Use
OCR engine + SER
tandem prediction
Use the following command to complete the series prediction of OCR engine + SER
, taking the pretrained SER model as an example:
CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_42.jpg
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the config.Global.save_res_path
field. The prediction result text file is named infer_results.txt
.
- End-to-end evaluation of
OCR engine + SER
prediction system
First use the tools/infer_vqa_token_ser.py
script to complete the prediction of the dataset, then use the following command to evaluate.
export CUDA_VISIBLE_DEVICES=0
python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt
- start training
Before starting training, you need to modify the following four fields
Train.dataset.data_dir
: point to the directory where the training set images are storedTrain.dataset.label_file_list
: point to the training set label fileEval.dataset.data_dir
: refers to the directory where the validation set images are storedEval.dataset.label_file_list
: point to the validation set label file
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml
Finally, precision
, recall
, hmean
and other indicators will be printed.
In the ./output/re_layoutxlm/
folder will save the training log, the optimal model and the model for the latest epoch.
- resume training
To resume training, assign the folder path of the previously trained model to the Architecture.Backbone.checkpoints
field.
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
- evaluate
Evaluation requires assigning the folder path of the model to be evaluated to the Architecture.Backbone.checkpoints
field.
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Finally, precision
, recall
, hmean
and other indicators will be printed
- Use
OCR engine + SER + RE
tandem prediction
Use the following command to complete the series prediction of OCR engine + SER + RE
, taking the pretrained SER and RE models as an example:
export CUDA_VISIBLE_DEVICES=0
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the config.Global.save_res_path
field. The prediction result text file is named infer_results.txt
.
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)