-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to tesseract 5.0-alpha #206
Comments
5.0-alpha is stable for production by the docs, just the api is being changed. Tested with ~7k images with 48 threads, OCR time decreased from 625s (4.1) to 430s (5.0) using 'por' language. Number of hits of some common words changed in interval -0.2% to +7% |
As working with #515 I am already dealing with version tesseract 5.0, I will try to evaluate if upgrading is beneficial. |
I ran a large test with 20K images/PDFs (collected from several cases, including only item with ocrCharCount >= 100), using the build I mentioned in #515 (latest release of version 5.0.0 with all image libraries included). |
It's not ideal, but what I have done is to do some searches for common words with and without diacritics (para com de em até você nós não you they are to of), and compare the number of docs returned for each tesseract version |
Tesseract 5.0.0 compiled for Windows. |
Thank you @tc-wleite! What models did you use in your tests, tessdata_fast? |
Yes, I used the same models were are already using, which are tessdata_fast, right? |
If I remember correctly, they are tessdata_fast. |
I downloaded the other models to run the test, and checked that we are indeed using the latest version from tessdata_fast. |
Time spent on OCR for each kind of Tesseract model:
EDIT: The word comparison data was wrong, as some of the counts were made with the duplicate filter active and others weren't. Correct data is below, including new words suggested by @lfcnassif. The "best" model (from tessdata_best) is way too slow, it doesn't seem an option to be considered. |
Thanks @tc-wleite! Maybe 2 letter words are not good choices for this comparison, sorry for suggesting them... If you have time, could you remove them and add some few more words to this last comparison? I thought about: este, esse, seu, uma, que, são, tem, mas |
|
Thanks @tc-wleite! Let's keep fast models then. |
@lfcnassif, I found a critical issue with the version I compiled. |
Thanks for warning. Does it belong to MS Visual C++ Redistributable Package? We include 2015 version dlls in tools/tsk/x64, maybe a similar approach could be done... |
Yes, it is a DLL from MS Visual C++ Redistributable Package, but 2019 version. When I built this new version, I noticed that issue could happen, but I ended up forgetting to check this. |
Using a tool that recursively inspects Windows executables to find dependencies, I found out that at least these DLL's (part of MS Visual C++ Redistributable Package 2019) are used by the tesseract.exe I am using:
Testing in the machine I detected the problem, putting these DLL's in the same folder of tesseract.exe was enough. |
Thanks @tc-wleite. If you find any issue, please let me know. |
Sure! By the way, I found an option in the makefile to use static link of MSVC runtime libraries. |
yes. |
Contains fixes and they disabled OpenMP support by default (already done by us in our win build) which results in great speed ups for already multithreaded apps.
The text was updated successfully, but these errors were encountered: