-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA Spec #225
Conversation
LGTM! |
Co-authored-by: mweidling <13831557+mweidling@users.noreply.github.com>
Co-authored-by: mweidling <13831557+mweidling@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged all your proposals AFAICT.
For the future please modify only the YAML files, the JSON files are generated from them. I can also generate the JSON in a non-pretty-printed format to reduce confusion.
Yeah, I noticed that too – too late. 🤪 Will do in the future! |
Just to make it as clear as possible: a character regarding these definitions is a glyph? Something printable visual, a graphical representation of a character? Saying so, any special whitespace codepoint (spatium, tab, zero-width spatium, invisible times, ... ) is not a character regarding OCR-D QA? IMHO this is quite reasonable. This doesn't apply to word-based metrics. But since usually structured GT shall be the backbone for evaluation, word boundaries or words at all are present already in the data if it is present at least on word level. Since this implies concerning character-based textual evaluation to strip off any spaces forehand, it should be cleared on which level (line with spaces or finer) both GT and related candidate data are available. If GT is for whatever reasons only on line-level present, I assume that these spaces are normalized too or even inserted by some legacy tooling meaning there's no reason either to keep these code points either. |
Due to ongoing changes in the ocrd_eval schema I think we should omit those changes for this PR to separate the different issues that this PR is trying to solve: defining a first draft of metrics definitions and creating a JSON schema for the Quiver API. Since the API is still at a very early stage I'm not quite sure if it generates that much value for us if we create a spec right now. |
I second that. Since the requirements for the UI are not completely clear yet, we should move the JSON schema for the data to be delivered by the back end to a separate branch. |
Fine with me. |
Schema changes now in #236 |
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devil in the details...
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can live with the PR as it is now.
If you want to elaborate in the definition BoW metrics, fine. If BWE should be replaced with BoW-Precision and BoW-Recall, also fine – as long as you give a concrete definition (ideally also mentioning which existing implementations provide which precise measure).
More references you might want to include:
- Leifert & Labahn 2019: End-to-End Measure for Text Recognition (on CER and derived metrics, analysis of reading order, segmentation and geometry influences)
- Zhang et al 2021: Rethinking Semantic Segmentation Evaluation for Explainability and Model Selection (semantic segmentation metrics like IoU discussed specifically with regard to over-segmentation and under-segmentation, proposes new metrics too)
- Rice 1996: Measuring the Accuracy of Page-Reading Systems (on distance algorithms, rates, derived metrics)
- Kanai & Rice 1995: Automated Evaluation of OCR Zoning (on layout evaluation, with metrics like
Move Counting
) - Alberti et al 2017: Open Evaluation Tool for Layout Analysis of Document Images (basic metrics around layout evaluation)
- Clausner et al 2011: Scenario Driven In-Depth Performance Evaluation of Document Layout
Analysis Methods (original PRImA Layout Performance Score and discussion)
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Merged and wil release it later. There are still open questions and "postponed" metrics but it is an excellent first version we can and will build upon. |
If I missed an unresolved discussion or some aspect that should be tracked in a dedicated issue, please let me know and/or open an issue. |
The only open discussion is the one about BoW metrics. It's still somewhat valuable because it shows which implementations use which definitions. (Or should we add this to our Evaluation Wiki page?) |
This pull request offers our first draft for the QA Specs. It consists of two main parts:
ocrd_eval.md
(which is equal to https://pad.gwdg.de/rLDBVhmYQ8CwOd67KcYHwQ#)the schema for the file format we want to use e.g. for the benchmarking,cf. QA Spec - Schema #236ocrd_eval.sample.yml