Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【OCR Issue No.9】移除明确不适合放在ppocr依赖中的依赖项 #11946

Merged
merged 11 commits into from
Apr 26, 2024

Conversation

Liyulingyue
Copy link
Collaborator

@Liyulingyue Liyulingyue commented Apr 16, 2024

#11906 task9
#11924

为了减小paddleocr的依赖,将部分包移除requirement.txt,采用paddle.utils.try_import的方式引用,当用户使用到时,提示用户安装。
各个移除的依赖项如下所示:

  • pdf2docx: 用于将pdf转化为word
  • lxml: 在table_metric.py的TEDS::evaluate函数中使用,该函数应当被用于验证评估阶段
  • premailer: 将html转化为excel
  • openpyxl: 辅助输出excel

Copy link

paddle-bot bot commented Apr 16, 2024

Thanks for your contribution!

@Liyulingyue
Copy link
Collaborator Author

@GreatV @sunzhongkai588

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是得有一种机制在运行 paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --use_pdf2docx_api=true 自动安装依赖

@Liyulingyue
Copy link
Collaborator Author

是不是得有一种机制在运行 paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --use_pdf2docx_api=true 自动安装依赖

运行时安装也是可以的,但是我觉得运行时安装包是不是不太好?我更推荐使用extern的方式指定包的安装

@GreatV
Copy link
Collaborator

GreatV commented Apr 17, 2024

类似于 pip install "paddleocr[structure]"

@Liyulingyue Liyulingyue requested a review from GreatV April 25, 2024 13:07
@Liyulingyue
Copy link
Collaborator Author

@jzhang533

@GreatV GreatV requested a review from jzhang533 April 25, 2024 13:24
@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

测试发现能够正常工作

pip uninstall lxml
Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 32, in try_import
    mod = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/premailer/__init__.py", line 1, in <module>
    from .premailer import Premailer, transform  # noqa
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/premailer/premailer.py", line 12, in <module>
    from lxml import etree
ModuleNotFoundError: No module named 'lxml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/tools/test.py", line 11, in <module>
    save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 279, in save_structure_res
    to_excel(region["res"]["html"], excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 153, in to_excel
    tablepyxl.document_to_xl(html_table, excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 118, in document_to_xl
    wb = document_to_workbook(doc, base_url=base_url)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 93, in document_to_workbook
    try_import("premailer")
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 41, in try_import
    raise ImportError(err_msg)
ImportError: Failed importing premailer. This likely means that some paddle modules require additional dependencies that have to be manually installed (usually with `pip install premailer`). 
pip uninstall premailer
Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 32, in try_import
    mod = importlib.import_module(module_name)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'premailer'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/greatx/repos/PaddleOCR/tools/test.py", line 11, in <module>
    save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 279, in save_structure_res
    to_excel(region["res"]["html"], excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 153, in to_excel
    tablepyxl.document_to_xl(html_table, excel_path)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 118, in document_to_xl
    wb = document_to_workbook(doc, base_url=base_url)
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/tablepyxl/tablepyxl.py", line 93, in document_to_workbook
    try_import("premailer")
  File "/home/greatx/repos/PaddleOCR/venv/lib/python3.10/site-packages/paddle/utils/lazy_import.py", line 41, in try_import
    raise ImportError(err_msg)
ImportError: Failed importing premailer. This likely means that some paddle modules require additional dependencies that have to be manually installed (usually with `pip install premailer`).

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

ppstructure 目录下要不要建一个requirement.txt用来让用户手动一键安装依赖

@Liyulingyue
Copy link
Collaborator Author

ppstructure 目录下要不要建一个requirement.txt用来让用户手动一键安装依赖

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了。除非新增一个函数,专门用来安装依赖

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了

用户clone下来的情况下

@Liyulingyue
Copy link
Collaborator Author

都已经打包成ppcor了,用户应该已经没办法通过requirement进行一键安装了

用户clone下来的情况下

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

@GreatV
Copy link
Collaborator

GreatV commented Apr 25, 2024

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

先setup.py里面去掉ppstructure/recovery/requirements.txt,后面改造pyproject.toml再说

@Liyulingyue
Copy link
Collaborator Author

这样就意味着ppstructure/recovery/requirements.txt是需要被保留的,即将所有try_import的包加到这里面

先setup.py里面去掉ppstructure/recovery/requirements.txt,后面改造pyproject.toml再说

关于这个移除的工作,我建议之后先考虑ppstructure是否需要保留在此项目中,如果不需要保留,则完成迁移后再移除。

@jzhang533 jzhang533 merged commit b5eedf7 into PaddlePaddle:main Apr 26, 2024
3 checks passed
@Liyulingyue Liyulingyue deleted the requirements_short branch May 25, 2024 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants