Upgrade to tesseract 5.0-alpha #206

lfcnassif · 2020-07-18T21:08:25Z

Contains fixes and they disabled OpenMP support by default (already done by us in our win build) which results in great speed ups for already multithreaded apps.

lfcnassif · 2021-02-18T19:29:46Z

5.0-alpha is stable for production by the docs, just the api is being changed.

Tested with ~7k images with 48 threads, OCR time decreased from 625s (4.1) to 430s (5.0) using 'por' language.

Number of hits of some common words changed in interval -0.2% to +7%

wladimirleite · 2021-06-04T00:49:17Z

As working with #515 I am already dealing with version tesseract 5.0, I will try to evaluate if upgrading is beneficial.

wladimirleite · 2021-06-04T15:31:35Z

I ran a large test with 20K images/PDFs (collected from several cases, including only item with ocrCharCount >= 100), using the build I mentioned in #515 (latest release of version 5.0.0 with all image libraries included).
OCR time decreased from 1655s to 1256s.
I also used 'por' language, 48 threads and IPED 3.18.6.
About the recognition results, inspecting recognized text from each version side-by-side for many images, I noticed that results are very similar. There are differences though. In some cases, just blank lines or line breaks changed. In the cases that the text itself changed, usually it was a word previously ignored that appears with the new version (or the opposite).
I don't have the ground truth of these documents, so I can't give a precise evaluation, but from the images that I inspected, the new version (5) results were slightly better.
I will run more tests when #515 is working, but so far upgrading seems to be a great idea, as it will clearly reduce processing time.

lfcnassif · 2021-06-04T15:54:26Z

I don't have the ground truth of these documents, so I can't give a precise evaluation, but from the images that I inspected, the new version (5) results were slightly better.

It's not ideal, but what I have done is to do some searches for common words with and without diacritics (para com de em até você nós não you they are to of), and compare the number of docs returned for each tesseract version

wladimirleite · 2021-06-10T15:23:02Z

Tesseract 5.0.0 compiled for Windows.
tesseract.zip

lfcnassif · 2021-06-10T18:17:05Z

Thank you @tc-wleite! What models did you use in your tests, tessdata_fast?

wladimirleite · 2021-06-10T20:05:03Z

Thank you @tc-wleite! What models did you use in your tests, tessdata_fast?

Yes, I used the same models were are already using, which are tessdata_fast, right?
As these models were released a few years ago, I did check if there were new models available, but the ones we are using are still the latest.
Regarding the "fast", I saw that there are regular models ("tessdata") and larger ones ("tessdata_best").
I will repeat the last test I ran (with ~50K images/PDFs) with these models and see how it goes.

lfcnassif · 2021-06-10T21:48:00Z

If I remember correctly, they are tessdata_fast.

wladimirleite · 2021-06-10T22:25:14Z

I downloaded the other models to run the test, and checked that we are indeed using the latest version from tessdata_fast.

wladimirleite · 2021-06-11T23:58:37Z

Time spent on OCR for each kind of Tesseract model:

         FAST REGULAR  BEST
------- ----- ------- -----
Time(s)  2560    3022  6292

EDIT: The word comparison data was wrong, as some of the counts were made with the duplicate filter active and others weren't. Correct data is below, including new words suggested by @lfcnassif.

The "best" model (from tessdata_best) is way too slow, it doesn't seem an option to be considered.
The "regular" (from tessdata) increased the OCR time by ~20% (comparing to "fast"), but it is not clear if the results are better.
I would keep the current models ("fast") for now.

lfcnassif · 2021-06-14T13:44:33Z

Thanks @tc-wleite! Maybe 2 letter words are not good choices for this comparison, sorry for suggesting them... If you have time, could you remove them and add some few more words to this last comparison? I thought about:

este, esse, seu, uma, que, são, tem, mas
the, that, this, from, and, with, was, not

wladimirleite · 2021-06-14T14:44:40Z

        FAST REGULAR  BEST
------ ----- ------- -----
para    4398    4338  4437
com     5413    5330  5447
até     1345    1293  1340
você     660     652   665
nós     1308    1263  1319
não     2683    2617  2710
you      180     180   187
they      17      16    17
are      129      95   136
este     916     881   924
esse     231     192   241
seu     1110    1092  1108
uma      974     947   978
que     2666    2627  2643
são     2736    2669  2771
tem      607     577   612
mas      239     208   258
the      378     380   388
that      77      77    79
this     142     137   144
from     125     124   128
and      548     535   553
with     113     108   119
was       29      24    30
not      309     306   316
------ ----- ------- -----
TOTAL  27333   26668 27550

lfcnassif · 2021-06-14T15:02:07Z

Thanks @tc-wleite! Let's keep fast models then.

wladimirleite · 2021-06-16T14:36:37Z

@lfcnassif, I found a critical issue with the version I compiled.
It depends on a Visual Studio runtime DLL, which is not always present on machines, neither can be statically linked.
I will try to figure out a solution.

lfcnassif · 2021-06-16T14:50:52Z

Thanks for warning. Does it belong to MS Visual C++ Redistributable Package? We include 2015 version dlls in tools/tsk/x64, maybe a similar approach could be done...

wladimirleite · 2021-06-16T15:00:20Z

Yes, it is a DLL from MS Visual C++ Redistributable Package, but 2019 version.
I guess the same approach should work.
I will try to isolate which files are necessary.

When I built this new version, I noticed that issue could happen, but I ended up forgetting to check this.
Today I tried to run in another machine, for other reasons, and got the error.

wladimirleite · 2021-06-18T15:35:45Z

Using a tool that recursively inspects Windows executables to find dependencies, I found out that at least these DLL's (part of MS Visual C++ Redistributable Package 2019) are used by the tesseract.exe I am using:

msvcp140.dll
vcruntime140.dll
vcruntime140_1.dll

vs2019-tesseract-dlls.zip

Testing in the machine I detected the problem, putting these DLL's in the same folder of tesseract.exe was enough.
I will try to test in other machines, to make sure nothing else is missing.

lfcnassif · 2021-06-22T13:25:32Z

Thanks @tc-wleite. If you find any issue, please let me know.

wladimirleite · 2021-06-22T13:45:37Z

Thanks @tc-wleite. If you find any issue, please let me know.

Sure!
You included the DLLs, right?

By the way, I found an option in the makefile to use static link of MSVC runtime libraries.
But it is has an "if condition" that enables it only for older versions. When I tried to overwrite this option, it seemed that it would work, but eventually failed when linking to image libraries (that were built without this option, and mixing is not allowed).
I had to give up and stick with the previous build, putting the 3 DLLs in the same folder.

lfcnassif · 2021-06-22T14:01:48Z

You included the DLLs, right?

yes.

lfcnassif added the enhancement label Jul 18, 2020

lfcnassif changed the title ~~Update to tesseract 4.1.1~~ Upgrade to tesseract 4.1.1 Jul 19, 2020

lfcnassif added a commit that referenced this issue Feb 18, 2021

#206: support tesseract 4.1/5.0; log error if tesseract returned error

3ad39f5

lfcnassif changed the title ~~Upgrade to tesseract 4.1.1~~ Upgrade to tesseract 5.0-alpha Feb 18, 2021

lfcnassif added a commit that referenced this issue Mar 18, 2021

#206: support tesseract 4.1/5.0; log error if tesseract returned error

2671196

lfcnassif self-assigned this Mar 24, 2021

lfcnassif removed their assignment Apr 14, 2021

lfcnassif mentioned this issue May 29, 2021

Run OCR on HEIC, PSD, WEBP, WMF, EMF, SVG, JBIG2 and other non standard image types #515

Closed

wladimirleite self-assigned this Jun 4, 2021

wladimirleite mentioned this issue Jun 9, 2021

Enhancements in thumbnail generation and image viewer for non-standard formats #575

Closed

lfcnassif mentioned this issue Jun 16, 2021

Some apps in tools folder depend on MS DLL not always present in processing machine #607

Closed

lfcnassif closed this as completed in 70849d4 Jun 22, 2021

lfcnassif added a commit that referenced this issue Aug 18, 2021

closes #206: upgrade to tesseract-5.0.0-alpha (by @tc-wleite)

67dfcf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to tesseract 5.0-alpha #206

Upgrade to tesseract 5.0-alpha #206

lfcnassif commented Jul 18, 2020 •

edited

Loading

lfcnassif commented Feb 18, 2021 •

edited

Loading

wladimirleite commented Jun 4, 2021

wladimirleite commented Jun 4, 2021

lfcnassif commented Jun 4, 2021

wladimirleite commented Jun 10, 2021

lfcnassif commented Jun 10, 2021

wladimirleite commented Jun 10, 2021

lfcnassif commented Jun 10, 2021

wladimirleite commented Jun 10, 2021

wladimirleite commented Jun 11, 2021 •

edited

Loading

lfcnassif commented Jun 14, 2021

wladimirleite commented Jun 14, 2021

lfcnassif commented Jun 14, 2021

wladimirleite commented Jun 16, 2021

lfcnassif commented Jun 16, 2021

wladimirleite commented Jun 16, 2021

wladimirleite commented Jun 18, 2021

lfcnassif commented Jun 22, 2021

wladimirleite commented Jun 22, 2021 •

edited

Loading

lfcnassif commented Jun 22, 2021

Upgrade to tesseract 5.0-alpha #206

Upgrade to tesseract 5.0-alpha #206

Comments

lfcnassif commented Jul 18, 2020 • edited Loading

lfcnassif commented Feb 18, 2021 • edited Loading

wladimirleite commented Jun 4, 2021

wladimirleite commented Jun 4, 2021

lfcnassif commented Jun 4, 2021

wladimirleite commented Jun 10, 2021

lfcnassif commented Jun 10, 2021

wladimirleite commented Jun 10, 2021

lfcnassif commented Jun 10, 2021

wladimirleite commented Jun 10, 2021

wladimirleite commented Jun 11, 2021 • edited Loading

lfcnassif commented Jun 14, 2021

wladimirleite commented Jun 14, 2021

lfcnassif commented Jun 14, 2021

wladimirleite commented Jun 16, 2021

lfcnassif commented Jun 16, 2021

wladimirleite commented Jun 16, 2021

wladimirleite commented Jun 18, 2021

lfcnassif commented Jun 22, 2021

wladimirleite commented Jun 22, 2021 • edited Loading

lfcnassif commented Jun 22, 2021

lfcnassif commented Jul 18, 2020 •

edited

Loading

lfcnassif commented Feb 18, 2021 •

edited

Loading

wladimirleite commented Jun 11, 2021 •

edited

Loading

wladimirleite commented Jun 22, 2021 •

edited

Loading