You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the documentation, you may set the setting:
# For PDF files it is first tried to read the text parts of the
# PDF. But PDFs can be complex documents and they may contain text
# and images. If the returned text is shorter than the value
# below, OCR is run afterwards. Then both extracted texts are
# compared and the longer will be used.
DOCSPELL_JOEX_EXTRACTION_PDF_MIN__TEXT__LEN=500
It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.
My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.
For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S I S A T E S T" instead of "THIS IS A TEST"
Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)
I would be nice to always force joex's OCR data to apply.
The text was updated successfully, but these errors were encountered:
According to the documentation, you may set the setting:
It would be nice to set a value like "-1" to force Docspell's OCR Data to apply.
My motivation:
I have plenty of PDF files which are already OCRed.
However some of them are wrongly processed or not accurate enough.
Some have been wrongly processed by language, have encoding errors, etc.
For example:
I have a few (ocred) PDF files, which have text like this:
"T H I S I S A T E S T" instead of "THIS IS A TEST"
Due to this behavior, the actual OCR length of this already ocred file is most likely to exceed the joex length check and so this file has always a greater length, than the correctly processed OCR by joex (which is less, but more accurate)
I would be nice to always force joex's OCR data to apply.
The text was updated successfully, but these errors were encountered: