[Bug] Pdf2parquet inbuilt ocr error #1042

ShiroYasha18 · 2025-02-11T15:26:58Z

Search before asking

I searched the issues and found no similar issues.

Component

Documentation

What happened + What you expected to happen

I was trying to do pdf2parquet on google colab using the inbuilt ocr- easy ocr to be specific. Apparently it is not working as in throwing the error when the parameters are are set according to the documentation

Reproduction script

I have attached the images of the errors and the documentation which was about the parameters of the pdf2parquet

Anything else

If I remove the do_ocr and the ocr_engine parameter it will work just fine soo rest of the things are working fine and all dependencies are installed properly

OS

Other

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

agoyal26 · 2025-02-12T05:18:37Z

@dolfim-ibm Please see above. Is this related to Docling functionality?

dolfim-ibm · 2025-02-12T05:36:20Z

I think you have to prepend the arguments with pdf2parquet. Whether it is needed or not, it depends if the transform is used via the DPK launcher or standalone (like all other transforms).

Example at https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/dpk_pdf2parquet/local_python.py#L44

ShiroYasha18 · 2025-02-12T09:09:58Z

UPDATE:
ok prepending pdf2parquet worked for do_ocr param but I cannot still change the ocr_engine from easyocr to tesseract

however it works fine if the params are :
"pdf2parquet_do_ocr": True,
"pdf2parquet_ocr_engine": "tesseract_cli",

the ocr engine is working fine for easyocr and tesseract_cli

dolfim-ibm · 2025-02-12T10:02:57Z

Do you get an error message or something else?

ShiroYasha18 · 2025-02-12T10:08:13Z

Exception creating transform tesserocr is not correctly installed. Please install it via pip install tesserocr to use this OCR engine. Note that tesserocr might have to be manually compiled for working withyour Tesseract installation. The Docling documentation provides examples for it. Alternatively, Docling has support for other OCR engines. See the documentation.
Traceback (most recent call last):

mostly this I uninstalled and reinstalled both tesseract and tesserocr still does not work

dolfim-ibm · 2025-02-12T10:42:04Z

Ok, then it is the non-trivial tesserocr installation. We described it a bit at https://ds4sd.github.io/docling/installation/.

Very likely, all you need is running this

pip uninstall tesserocr
pip install --no-binary :all: tesserocr

Unfortunately, this is caused by tesserocr Pypi wheel linking to a specific version of tesseract, which is not what users have locally. The command above is requesting to compile from sources and link to the tesseract version on your system.

ShiroYasha18 · 2025-02-13T05:21:11Z

Yeah running that did not fix that but now other two are working so I am good to do - just a little recommendation in the docs maybe mention the prepend thing with a asterisk and it will be awesome!

Thanks a ton for the help !

ShiroYasha18 added the bug Something isn't working label Feb 11, 2025

ShiroYasha18 closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Pdf2parquet inbuilt ocr error #1042

[Bug] Pdf2parquet inbuilt ocr error #1042

ShiroYasha18 commented Feb 11, 2025

agoyal26 commented Feb 12, 2025

dolfim-ibm commented Feb 12, 2025 •

edited

Loading

ShiroYasha18 commented Feb 12, 2025 •

edited

Loading

dolfim-ibm commented Feb 12, 2025

ShiroYasha18 commented Feb 12, 2025 •

edited

Loading

dolfim-ibm commented Feb 12, 2025

ShiroYasha18 commented Feb 13, 2025

[Bug] Pdf2parquet inbuilt ocr error #1042

[Bug] Pdf2parquet inbuilt ocr error #1042

Comments

ShiroYasha18 commented Feb 11, 2025

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

agoyal26 commented Feb 12, 2025

dolfim-ibm commented Feb 12, 2025 • edited Loading

ShiroYasha18 commented Feb 12, 2025 • edited Loading

dolfim-ibm commented Feb 12, 2025

ShiroYasha18 commented Feb 12, 2025 • edited Loading

dolfim-ibm commented Feb 12, 2025

ShiroYasha18 commented Feb 13, 2025

dolfim-ibm commented Feb 12, 2025 •

edited

Loading

ShiroYasha18 commented Feb 12, 2025 •

edited

Loading

ShiroYasha18 commented Feb 12, 2025 •

edited

Loading