Feature request: rearrange paragraphs for minimum difference #32

jbarth-ubhd · 2020-10-06T12:05:34Z

It would be nice if dinglehopper could try to arrange paragraphs so that wrong segmentation order (perhaps not so important for full text search) could be ignored.

mikegerber · 2020-10-08T13:00:31Z

I'm leaning towards providing the UWER (unordered word error rate) in dinglehopper to resolve this.

Thoughts:

I don't think a layout analysis feature - which reordering the paragraphs is - would be appropriate here in an evaluation tool. If there's a simple algorithm that solves most issues, there should a separate tool to do this in the OCR-D community.
Just trying all permutations of paragraphs is IMHO no good as this would be on the order of O(m!) for m paragraphs

cneud · 2020-10-28T23:26:26Z

I must agree, calculation of reliable accuracy rates with wrong segmentation order is beyond the possibilities of dinglehopper. The sheer amount of possible segmentation classes/errors is escalating way too quickly!

As always when it comes to the topic of evaluation, the PRImA group have some good publications about this, e.g. The Significance of Reading Order in Document Recognition and its Evaluation and Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods.

The typical solution for this adopted in other evaluation tools is to include the Bag-of-words (BOW) metric, which is easy to compute and could probably be supported by dinglehopper too.

An interesting more recent addition in the scientific community is the Flexible character accuracy measure.

b2m · 2020-11-04T16:26:26Z

An interesting more recent addition in the scientific community is the Flexible character accuracy measure.

As I have a similar problem and need a solution I will try to integrate the Flexible Character Accuracy as option for Dinglehopper.

jbarth-ubhd · 2021-02-17T12:50:40Z

Flew through this paper. Does compare strings of GT with substrings of OCR (in case of erroneously joined columns).

( I assume the "equal-length distance" editDist(..., substr(..., t2.length )) is because of runtime considerations, but in theory this is does not need to be same length; I would suggest word boundaries. )

I'll still think such a flexible comparison is essential - before runnning ocr-d in production - to verify the workflows in use.

b2m · 2021-02-17T13:21:27Z

@jbarth-ubhd

Flew through this paper. Does compare strings of GT with substrings of OCR (in case of erroneously joined columns).

Simplified Explanation: FCA compares a line from GT with all lines from OCR and either finds a satisfying match or splits the GT line into smaller fragments based on the best match found. There are more steps and some implementation details only visible in the Java Implementation.

( I assume the "equal-length distance" editDist(..., substr(..., t2.length )) is because of runtime considerations, but in theory this is does not need to be same length; I would suggest word boundaries. )

I am confused by your mentioning of the "equal-length distance"... maybe you confuse it with the splitting of lines into smaller fragments?

mikegerber self-assigned this Oct 8, 2020

mikegerber added the enhancement New feature or request label Oct 8, 2020

b2m mentioned this issue Nov 11, 2020

Add flexible character accuracy #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: rearrange paragraphs for minimum difference #32

Feature request: rearrange paragraphs for minimum difference #32

jbarth-ubhd commented Oct 6, 2020

mikegerber commented Oct 8, 2020

cneud commented Oct 28, 2020 •

edited

Loading

b2m commented Nov 4, 2020

jbarth-ubhd commented Feb 17, 2021 •

edited

Loading

b2m commented Feb 17, 2021 •

edited

Loading

Feature request: rearrange paragraphs for minimum difference #32

Feature request: rearrange paragraphs for minimum difference #32

Comments

jbarth-ubhd commented Oct 6, 2020

mikegerber commented Oct 8, 2020

cneud commented Oct 28, 2020 • edited Loading

b2m commented Nov 4, 2020

jbarth-ubhd commented Feb 17, 2021 • edited Loading

b2m commented Feb 17, 2021 • edited Loading

cneud commented Oct 28, 2020 •

edited

Loading

jbarth-ubhd commented Feb 17, 2021 •

edited

Loading

b2m commented Feb 17, 2021 •

edited

Loading