tesseract english data is now a separate package #2479

programmerq · 2024-01-30T20:11:56Z

It looks like https://git.alpinelinux.org/aports/commit/community/tesseract-ocr?id=e1dc19b16f34ba3faeba489ea3412d3b3c67c12f introduced the english data language as a separate package.

I noticed this error when trying to run OCR on a file where I had selected english:

Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'

In the docspell 0.40.0 joex image, the tesseract-ocr-5.2.0-r1 includes eng.traineddata (english), equ.traineddata (math equation detection), and osd.traineddata (orientation and script detection). But the tesseract-ocr-5.3.4-r0 package in docspell 0.41.0 joex doesn't include any of them.

I don't believe the osd/equ variants are used, so I didn't include them in the PR.

It looks like https://git.alpinelinux.org/aports/commit/community/tesseract-ocr?id=e1dc19b16f34ba3faeba489ea3412d3b3c67c12f introduced the english data language as a separate package.

eikek · 2024-01-30T20:53:02Z

Oh, thank you!! 🙏🏼

eikek · 2024-01-30T20:56:32Z

for reference #2374

tesseract english data is now a separate package

28141b6

It looks like https://git.alpinelinux.org/aports/commit/community/tesseract-ocr?id=e1dc19b16f34ba3faeba489ea3412d3b3c67c12f introduced the english data language as a separate package.

eikek added the docker All things regarding docker setup label Jan 30, 2024

eikek merged commit c8cb8b0 into eikek:master Jan 30, 2024
4 of 5 checks passed

eikek added this to the Docspell 0.42.0 milestone Feb 7, 2024

eikek mentioned this pull request Feb 29, 2024

Ocrmypdf fails due to tesseract #2504

Closed

eikek added the fix label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract english data is now a separate package #2479

tesseract english data is now a separate package #2479

programmerq commented Jan 30, 2024

eikek commented Jan 30, 2024

eikek commented Jan 30, 2024

tesseract english data is now a separate package #2479

tesseract english data is now a separate package #2479

Conversation

programmerq commented Jan 30, 2024

eikek commented Jan 30, 2024

eikek commented Jan 30, 2024