Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to tesseract 5.0-alpha #206

Closed
lfcnassif opened this issue Jul 18, 2020 · 20 comments
Closed

Upgrade to tesseract 5.0-alpha #206

lfcnassif opened this issue Jul 18, 2020 · 20 comments
Assignees

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Jul 18, 2020

Contains fixes and they disabled OpenMP support by default (already done by us in our win build) which results in great speed ups for already multithreaded apps.

@lfcnassif lfcnassif changed the title Update to tesseract 4.1.1 Upgrade to tesseract 4.1.1 Jul 19, 2020
@lfcnassif lfcnassif changed the title Upgrade to tesseract 4.1.1 Upgrade to tesseract 5.0-alpha Feb 18, 2021
@lfcnassif
Copy link
Member Author

lfcnassif commented Feb 18, 2021

5.0-alpha is stable for production by the docs, just the api is being changed.

Tested with ~7k images with 48 threads, OCR time decreased from 625s (4.1) to 430s (5.0) using 'por' language.

Number of hits of some common words changed in interval -0.2% to +7%

@wladimirleite
Copy link
Member

As working with #515 I am already dealing with version tesseract 5.0, I will try to evaluate if upgrading is beneficial.

@wladimirleite
Copy link
Member

I ran a large test with 20K images/PDFs (collected from several cases, including only item with ocrCharCount >= 100), using the build I mentioned in #515 (latest release of version 5.0.0 with all image libraries included).
OCR time decreased from 1655s to 1256s.
I also used 'por' language, 48 threads and IPED 3.18.6.
About the recognition results, inspecting recognized text from each version side-by-side for many images, I noticed that results are very similar. There are differences though. In some cases, just blank lines or line breaks changed. In the cases that the text itself changed, usually it was a word previously ignored that appears with the new version (or the opposite).
I don't have the ground truth of these documents, so I can't give a precise evaluation, but from the images that I inspected, the new version (5) results were slightly better.
I will run more tests when #515 is working, but so far upgrading seems to be a great idea, as it will clearly reduce processing time.

@lfcnassif
Copy link
Member Author

I don't have the ground truth of these documents, so I can't give a precise evaluation, but from the images that I inspected, the new version (5) results were slightly better.

It's not ideal, but what I have done is to do some searches for common words with and without diacritics (para com de em até você nós não you they are to of), and compare the number of docs returned for each tesseract version

@wladimirleite
Copy link
Member

Tesseract 5.0.0 compiled for Windows.
tesseract.zip

@lfcnassif
Copy link
Member Author

Thank you @tc-wleite! What models did you use in your tests, tessdata_fast?

@wladimirleite
Copy link
Member

Thank you @tc-wleite! What models did you use in your tests, tessdata_fast?

Yes, I used the same models were are already using, which are tessdata_fast, right?
As these models were released a few years ago, I did check if there were new models available, but the ones we are using are still the latest.
Regarding the "fast", I saw that there are regular models ("tessdata") and larger ones ("tessdata_best").
I will repeat the last test I ran (with ~50K images/PDFs) with these models and see how it goes.

@lfcnassif
Copy link
Member Author

If I remember correctly, they are tessdata_fast.

@wladimirleite
Copy link
Member

I downloaded the other models to run the test, and checked that we are indeed using the latest version from tessdata_fast.

@wladimirleite
Copy link
Member

wladimirleite commented Jun 11, 2021

Time spent on OCR for each kind of Tesseract model:

         FAST REGULAR  BEST
------- ----- ------- -----
Time(s)  2560    3022  6292

EDIT: The word comparison data was wrong, as some of the counts were made with the duplicate filter active and others weren't. Correct data is below, including new words suggested by @lfcnassif.

The "best" model (from tessdata_best) is way too slow, it doesn't seem an option to be considered.
The "regular" (from tessdata) increased the OCR time by ~20% (comparing to "fast"), but it is not clear if the results are better.
I would keep the current models ("fast") for now.

@lfcnassif
Copy link
Member Author

Thanks @tc-wleite! Maybe 2 letter words are not good choices for this comparison, sorry for suggesting them... If you have time, could you remove them and add some few more words to this last comparison? I thought about:

este, esse, seu, uma, que, são, tem, mas
the, that, this, from, and, with, was, not

@wladimirleite
Copy link
Member

        FAST REGULAR  BEST
------ ----- ------- -----
para    4398    4338  4437
com     5413    5330  5447
até     1345    1293  1340
você     660     652   665
nós     1308    1263  1319
não     2683    2617  2710
you      180     180   187
they      17      16    17
are      129      95   136
este     916     881   924
esse     231     192   241
seu     1110    1092  1108
uma      974     947   978
que     2666    2627  2643
são     2736    2669  2771
tem      607     577   612
mas      239     208   258
the      378     380   388
that      77      77    79
this     142     137   144
from     125     124   128
and      548     535   553
with     113     108   119
was       29      24    30
not      309     306   316
------ ----- ------- -----
TOTAL  27333   26668 27550

@lfcnassif
Copy link
Member Author

Thanks @tc-wleite! Let's keep fast models then.

@wladimirleite
Copy link
Member

@lfcnassif, I found a critical issue with the version I compiled.
It depends on a Visual Studio runtime DLL, which is not always present on machines, neither can be statically linked.
I will try to figure out a solution.

@lfcnassif
Copy link
Member Author

Thanks for warning. Does it belong to MS Visual C++ Redistributable Package? We include 2015 version dlls in tools/tsk/x64, maybe a similar approach could be done...

@wladimirleite
Copy link
Member

Yes, it is a DLL from MS Visual C++ Redistributable Package, but 2019 version.
I guess the same approach should work.
I will try to isolate which files are necessary.

When I built this new version, I noticed that issue could happen, but I ended up forgetting to check this.
Today I tried to run in another machine, for other reasons, and got the error.

@wladimirleite
Copy link
Member

Using a tool that recursively inspects Windows executables to find dependencies, I found out that at least these DLL's (part of MS Visual C++ Redistributable Package 2019) are used by the tesseract.exe I am using:

  • msvcp140.dll
  • vcruntime140.dll
  • vcruntime140_1.dll

vs2019-tesseract-dlls.zip

Testing in the machine I detected the problem, putting these DLL's in the same folder of tesseract.exe was enough.
I will try to test in other machines, to make sure nothing else is missing.

@lfcnassif
Copy link
Member Author

Thanks @tc-wleite. If you find any issue, please let me know.

@wladimirleite
Copy link
Member

wladimirleite commented Jun 22, 2021

Thanks @tc-wleite. If you find any issue, please let me know.

Sure!
You included the DLLs, right?

By the way, I found an option in the makefile to use static link of MSVC runtime libraries.
But it is has an "if condition" that enables it only for older versions. When I tried to overwrite this option, it seemed that it would work, but eventually failed when linking to image libraries (that were built without this option, and mixing is not allowed).
I had to give up and stick with the previous build, putting the 3 DLLs in the same folder.

@lfcnassif
Copy link
Member Author

You included the DLLs, right?

yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants