-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move to AlternativeImage feature selectors in OCR-D/core#294: #75
move to AlternativeImage feature selectors in OCR-D/core#294: #75
Conversation
- all: use second output position as fileGrp USE to produce AlternativeImage - all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId - all: use OcrdExif.resolution instead of xResolution - all: create images with monotonically growing @comments (features) - crop: use ocrd_utils.crop_image instead of PIL.Image.crop - crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring - crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards) - deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level): - for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level) - for the region image, never use images already rotated (except for our own page-level rotation) - segment-region: forgot to add feature "cropped" when producing cropped images
Also fixes #61. |
Codecov Report
@@ Coverage Diff @@
## master #75 +/- ##
==========================================
- Coverage 47.81% 46.99% -0.83%
==========================================
Files 8 8
Lines 688 715 +27
Branches 130 134 +4
==========================================
+ Hits 329 336 +7
- Misses 326 346 +20
Partials 33 33
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the typos are must have. Not sure about the multiple file group logistics.
Technically, the changes proposed here carry over to OCRopus
and anybaseocr
. Does it make sense to add abstract wrappers to core
for the single processing steps (i.e. ProcessorCrop
) from which the module project implementations could derive?
Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the |
@@ -16,7 +16,7 @@ | |||
|
|||
setup( | |||
name='ocrd_tesserocr', | |||
version='0.4.0', | |||
version='0.4.1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to do that in explicit version bumping commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. @kba, should I revert this in the PR? Do you want to do the merge and the versioning commit along with the new release yourself?
Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the (e.g. requiring processors to implement a protected |
This reverts commit 55d1d87. (Current Tesseract LSTM models all expect to see binarized images, because they were trained on such data. This may, however, change in the future.)
The last 2 commits explained: I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization. So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure! |
Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).
On
Results are similar in tendency for
So Tesseract gets perplexed …
|
It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop. |
True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:
Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that? GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running So, I think the situation demands:
|
@bertsky @stweil So our impression
that GT4HistOCR is suboptimal for OCR training is real.
I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in |
@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and #75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation). CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even |
I agree, both degradation and binarization should be employed to make GT4HistOCR models robust. |
@bertsky Is there PR in core which blocks merging here? |
@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)? BTW, according to my measurements,
|
No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now. |
AlternativeImage
with the updated PAGE model, we can now use
@externalModel and @externalId
(features)
are already region coordinates that we are ignoring
(we must crop ourselves, and deskewing can not happen
until afterwards)
related to whether deskewing has already been applied
(on the page or region level):
(both for page level and region level processing,
but for the region level, do rotate images ad hoc
if @orientation is present on the page level)
(except for our own page-level rotation)
producing cropped images