Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Pdf2parquet inbuilt ocr error #1042

Closed
1 of 2 tasks
ShiroYasha18 opened this issue Feb 11, 2025 · 7 comments
Closed
1 of 2 tasks

[Bug] Pdf2parquet inbuilt ocr error #1042

ShiroYasha18 opened this issue Feb 11, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@ShiroYasha18
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Component

Documentation

What happened + What you expected to happen

I was trying to do pdf2parquet on google colab using the inbuilt ocr- easy ocr to be specific. Apparently it is not working as in throwing the error when the parameters are are set according to the documentation

Reproduction script

Image

Image

I have attached the images of the errors and the documentation which was about the parameters of the pdf2parquet

Anything else

If I remove the do_ocr and the ocr_engine parameter it will work just fine soo rest of the things are working fine and all dependencies are installed properly

OS

Other

Python

3.10.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@ShiroYasha18 ShiroYasha18 added the bug Something isn't working label Feb 11, 2025
@agoyal26
Copy link
Collaborator

@dolfim-ibm Please see above. Is this related to Docling functionality?

@dolfim-ibm
Copy link
Member

dolfim-ibm commented Feb 12, 2025

I think you have to prepend the arguments with pdf2parquet. Whether it is needed or not, it depends if the transform is used via the DPK launcher or standalone (like all other transforms).

Example at https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/dpk_pdf2parquet/local_python.py#L44

@ShiroYasha18
Copy link
Author

ShiroYasha18 commented Feb 12, 2025

UPDATE:
ok prepending pdf2parquet worked for do_ocr param but I cannot still change the ocr_engine from easyocr to tesseract

however it works fine if the params are :
"pdf2parquet_do_ocr": True,
"pdf2parquet_ocr_engine": "tesseract_cli",

the ocr engine is working fine for easyocr and tesseract_cli

@dolfim-ibm
Copy link
Member

Do you get an error message or something else?

@ShiroYasha18
Copy link
Author

ShiroYasha18 commented Feb 12, 2025

Exception creating transform tesserocr is not correctly installed. Please install it via pip install tesserocr to use this OCR engine. Note that tesserocr might have to be manually compiled for working withyour Tesseract installation. The Docling documentation provides examples for it. Alternatively, Docling has support for other OCR engines. See the documentation.
Traceback (most recent call last):

mostly this I uninstalled and reinstalled both tesseract and tesserocr still does not work

@dolfim-ibm
Copy link
Member

Ok, then it is the non-trivial tesserocr installation. We described it a bit at https://ds4sd.github.io/docling/installation/.

Very likely, all you need is running this

pip uninstall tesserocr
pip install --no-binary :all: tesserocr

Unfortunately, this is caused by tesserocr Pypi wheel linking to a specific version of tesseract, which is not what users have locally. The command above is requesting to compile from sources and link to the tesseract version on your system.

@ShiroYasha18
Copy link
Author

Yeah running that did not fix that but now other two are working so I am good to do - just a little recommendation in the docs maybe mention the prepend thing with a asterisk and it will be awesome!

Thanks a ton for the help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants