-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPstructure 表格识别错误 #10649
Comments
同样的问题,请问解决了吗? |
@xxcoco763 检查之后似乎是模型的问题,复杂表格下模型准确度不够,所以将行列数判断错了。可以看看 https://blog.csdn.net/weixin_44451785/article/details/105888966,暂时先手动处理一下表格 |
上面链接失效了,还有其他方法解决colspan, rowspan文章或办法吗? |
@nissansz 这个是链接格式没写好,把逗号后面的去掉就行了。或者直接搜 python对图片中的表格拆分 |
@nissansz 如果是扫描文档,表格断开的话可以考虑用 opencv 做图像增强,做一些图形学的膨胀之类的加粗表格线。然后如果是拍照的图片变形了可能就要考虑手动用PS之类的修补了。 |
膨胀之类的加粗表格线,怎麽判断已经找到全部表格线? |
这个应该没有什么好办法了,要靠opencv+调参,不同文档可能还不一样。比如用霍夫变换什么的识别表格线,里面挺多参数要手动试的 |
应该是最后一行合并的时候报错了,在269行添加一个判断就好了 if cell_row + rowspan - 1 == len(rows): cell_to_merge = table.cell(cell_row + rowspan - 1, |
这是哪个py文件? |
我找到报错的问题是在表格提取成html时 |
我训练表格模型,有时正常span,有时消失。而且效果总没有官方好。 |
正常的表格识别没什么问题,有个问题是pdf转图片的时候 mat = fitz.Matrix(2, 2)用的是2,会导致图片模糊,你可以把图片打印出来看下,如果很模糊要改变缩放系数, if pm.width > 2000 or pm.height > 2000:也不能做这个缩小 |
有训练好的模型分享吗? |
没有,我也没训练过 |
crnn训练,英文识别率差,且英文单词间的空格缺失,这个有办法吗? |
有没有好的参数设置呢? |
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
系统环境/System Environment:
Windows 10 家庭中文版 22H2:
1060MaxQ + CUDA 11.6
版本号/Version:
Python:3.9, anaconda
Paddle:paddlepaddle-gpu==2.5.1.post116
PaddleOCR: 2.7
问题相关组件/Related components:PPStructure 版面回复
运行指令/Command Code:
python predict_system.py --image_dir=3.pdf --det_model_dir=inference/ch_PP-OCRv4_det_infer --rec_model_dir=inference/ch_PP-OCRv4_rec_infer --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt --layout_model_dir=inference/picodet_lcnet_x1_0_fgd_layout_cdla_infer --layout_dict_path=../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt --vis_font_path=../doc/fonts/simfang.ttf --recovery=True --output=./output/ --use_gpu=False
完整报错/Complete Error Message:
![image](https://private-user-images.githubusercontent.com/43161566/260957438-c09c9e6f-40ed-48b0-af03-09f87f66c326.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NTc0MzgtYzA5YzllNmYtNDBlZC00OGIwLWFmMDMtMDlmODdmNjZjMzI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNhNmQwMWI4Zjg2ZDJjMTFmOTlkZWRhOGZjMjYzOTU1YTZlYzVkM2Q5ZGRjOTkzMDMzMWNjOTAyZDlmZDEzNGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.pcx1LBaTNX9CJlx7MneoXFqe6iUAXrkm1EWHZm8xZrM)
PDF文档中的表格如下:
使用版面恢复,运行到最后出现
![image](https://private-user-images.githubusercontent.com/43161566/260956347-85848adc-abfe-488d-b3a1-7a1f237ec024.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NTYzNDctODU4NDhhZGMtYWJmZS00ODhkLWIzYTEtN2ExZjIzN2VjMDI0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWE5NWE4MjU4OTkyMDRmODllZTkyMThhMjRkNDFhODM2ZWJjNWIwNjYwNDNmYjY4NmZjYTk1YTU5ZDQ3NDg3ODImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.3I00IepRzJDw-pJr-x_J_aCqIYtqSk5zRdVnE1sONxc)
ppocr ERROR: error in layout recovery image:1.pdf, err msg: list index out of range
能输出文件
![image](https://private-user-images.githubusercontent.com/43161566/260956762-b80fe2ab-f87e-4497-9ef1-01b6717f5a86.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NTY3NjItYjgwZmUyYWItZjg3ZS00NDk3LTllZjEtMDFiNjcxN2Y1YTg2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJiYmI0MmFiMmZlYTliYTk5NDA4NTc4YjQyNGQxMWI2MzdkNjZjNzBjNmUzNGY0NTg1OWFhMjgwNGNiZDViMmQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.HPNjZ6vnjJhVJIhWu3KxkqWR9pKZboFYrrr4zmAixOM)
但是在 predict_system.py 292 行,调用 convert_info_docx(img, all_res, save_folder, img_name) 时出现问题
问题定位:
![image](https://private-user-images.githubusercontent.com/43161566/260958952-838a52b7-c36e-4fc2-b695-548e8ada3cbd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NTg5NTItODM4YTUyYjctYzM2ZS00ZmMyLWI2OTUtNTQ4ZThhZGEzY2JkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTEwYzNjZmQ0YWQ4OTFlMjY3YWRmNmFhYTc3ODNhYjBlOGNjZWE4ZjU3MGI5NDY5MjYxOTFkNTNkODUxZmNlODgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.q9mlZxRIkIgZsV4GuboLZ_FHqQ4FxgsIJ7YlvW9M3Q0)
![image](https://private-user-images.githubusercontent.com/43161566/260959465-b7d1f2e6-d862-479f-be2a-b93ea048f3dc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NTk0NjUtYjdkMWYyZTYtZDg2Mi00NzlmLWJlMmEtYjkzZWEwNDhmM2RjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYxMjlmMWMyZmY4ZjQ0ODQ4N2E0ZmU4MTAxZWM5M2Q1MDhlZTIyZWRlNGFiNDRiZjhmODEyNTU4MjAzMTg4ODImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.oV2d0Qkd16k6h9BRwdJqoY3PR_WktcLnqA7y_Mx7CM0)
原因在于,识别的表格行列数出现了格式错误
从 predict_system.py 调用 convert_info_docx
跳转到 recovery_to_doc.py 63行 parser.handle_table(region['res']['html'], doc)
调用 recovery/table_process.py 第238行 def handle_table(self, html, doc)
识别到表格的 res:
产生的 html 如下:
其中 handle_table 函数中,提取的 cols_len = 4
![image](https://private-user-images.githubusercontent.com/43161566/260960863-a29b29d3-0466-4e23-80bd-8c0b6d1e2587.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDcxOTUsIm5iZiI6MTczOTMwNjg5NSwicGF0aCI6Ii80MzE2MTU2Ni8yNjA5NjA4NjMtYTI5YjI5ZDMtMDQ2Ni00ZTIzLTgwYmQtOGMwYjZkMWUyNTg3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIwNDgxNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc5YmI4OTgwYzE0NTkzZjc2YTQyYTFlNWU2ZWU2YzFkN2VlMmViNDJiMDk2NjQxZjIzMjQwMjE3M2VhYjJjNjcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.HtYK1XpVLibf-usSWrWjhW0oMcWgd5jOAjktmxCY6Kk)
但是在表格的 html 中错误地出现了 colspan = 5
导致在函数中,出现了 list index out of range 的情况
这个 colspan=5 是在表格分析的过程中产生的,我无法解决,需要求助
The text was updated successfully, but these errors were encountered: