-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: rearrange paragraphs for minimum difference #32
Comments
I'm leaning towards providing the UWER (unordered word error rate) in dinglehopper to resolve this. Thoughts:
|
I must agree, calculation of reliable accuracy rates with wrong segmentation order is beyond the possibilities of As always when it comes to the topic of evaluation, the PRImA group have some good publications about this, e.g. The Significance of Reading Order in Document Recognition and its Evaluation and Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods. The typical solution for this adopted in other evaluation tools is to include the Bag-of-words (BOW) metric, which is easy to compute and could probably be supported by An interesting more recent addition in the scientific community is the Flexible character accuracy measure. |
As I have a similar problem and need a solution I will try to integrate the Flexible Character Accuracy as option for Dinglehopper. |
Flew through this paper. Does compare strings of GT with substrings of OCR (in case of erroneously joined columns). ( I assume the "equal-length distance" I'll still think such a flexible comparison is essential - before runnning ocr-d in production - to verify the workflows in use. |
Simplified Explanation: FCA compares a line from GT with all lines from OCR and either finds a satisfying match or splits the GT line into smaller fragments based on the best match found. There are more steps and some implementation details only visible in the Java Implementation.
I am confused by your mentioning of the "equal-length distance"... maybe you confuse it with the splitting of lines into smaller fragments? |
It would be nice if dinglehopper could try to arrange paragraphs so that wrong segmentation order (perhaps not so important for full text search) could be ignored.
The text was updated successfully, but these errors were encountered: