Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract english data is now a separate package #2479

Merged
merged 1 commit into from
Jan 30, 2024

Conversation

programmerq
Copy link
Contributor

It looks like https://git.alpinelinux.org/aports/commit/community/tesseract-ocr?id=e1dc19b16f34ba3faeba489ea3412d3b3c67c12f introduced the english data language as a separate package.

I noticed this error when trying to run OCR on a file where I had selected english:

Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'

In the docspell 0.40.0 joex image, the tesseract-ocr-5.2.0-r1 includes eng.traineddata (english), equ.traineddata (math equation detection), and osd.traineddata (orientation and script detection). But the tesseract-ocr-5.3.4-r0 package in docspell 0.41.0 joex doesn't include any of them.

I don't believe the osd/equ variants are used, so I didn't include them in the PR.

@eikek eikek added the docker All things regarding docker setup label Jan 30, 2024
@eikek
Copy link
Owner

eikek commented Jan 30, 2024

Oh, thank you!! 🙏🏼

@eikek
Copy link
Owner

eikek commented Jan 30, 2024

for reference #2374

@eikek eikek merged commit c8cb8b0 into eikek:master Jan 30, 2024
4 of 5 checks passed
@eikek eikek added this to the Docspell 0.42.0 milestone Feb 7, 2024
@eikek eikek added the fix label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker All things regarding docker setup fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants