Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force Docspell's OCR Engine to apply #1628

Closed
Snify89 opened this issue Jul 6, 2022 · 0 comments · Fixed by #1634
Closed

Force Docspell's OCR Engine to apply #1628

Snify89 opened this issue Jul 6, 2022 · 0 comments · Fixed by #1634

Comments

@Snify89
Copy link

Snify89 commented Jul 6, 2022

According to the documentation, you may set the setting:

# For PDF files it is first tried to read the text parts of the
# PDF. But PDFs can be complex documents and they may contain text
# and images. If the returned text is shorter than the value
# below, OCR is run afterwards. Then both extracted texts are
# compared and the longer will be used.
DOCSPELL_JOEX_EXTRACTION_PDF_MIN__TEXT__LEN=500

It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.

My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.

For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S I S A T E S T" instead of "THIS IS A TEST"

Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)

I would be nice to always force joex's OCR data to apply.

@eikek eikek added this to the Docspell 0.38.0 milestone Jul 7, 2022
eikek added a commit that referenced this issue Jul 7, 2022
@mergify mergify bot closed this as completed in #1634 Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants