Force Docspell's OCR Engine to apply #1628

Snify89 · 2022-07-06T09:10:11Z

According to the documentation, you may set the setting:

# For PDF files it is first tried to read the text parts of the
# PDF. But PDFs can be complex documents and they may contain text
# and images. If the returned text is shorter than the value
# below, OCR is run afterwards. Then both extracted texts are
# compared and the longer will be used.
DOCSPELL_JOEX_EXTRACTION_PDF_MIN__TEXT__LEN=500

It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.

My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.

For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S I S A T E S T" instead of "THIS IS A TEST"

Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)

I would be nice to always force joex's OCR data to apply.

The text was updated successfully, but these errors were encountered:

Fixes: #1628

eikek added this to the Docspell 0.38.0 milestone Jul 7, 2022

eikek added a commit that referenced this issue Jul 7, 2022

Allow to always use OCR extracted text

d413b16

Fixes: #1628

eikek mentioned this issue Jul 7, 2022

Allow to always use OCR extracted text #1634

Merged

mergify bot closed this as completed in #1634 Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force Docspell's OCR Engine to apply #1628

Force Docspell's OCR Engine to apply #1628

Snify89 commented Jul 6, 2022 •

edited by eikek

Loading

Force Docspell's OCR Engine to apply #1628

Force Docspell's OCR Engine to apply #1628

Comments

Snify89 commented Jul 6, 2022 • edited by eikek Loading

Snify89 commented Jul 6, 2022 •

edited by eikek

Loading