Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SCHEMATRON for additional quality checks #87

Open
cipriandinu opened this issue Jul 12, 2024 · 0 comments
Open

Use SCHEMATRON for additional quality checks #87

cipriandinu opened this issue Jul 12, 2024 · 0 comments

Comments

@cipriandinu
Copy link
Member

cipriandinu commented Jul 12, 2024

Starting from #62 we figure out that would make sense to have a generic way for more in detail validation of ALTO files content. Several ideas were discussed on board meetings:

  1. Use xsd 1.1 and asserts in order to implement some consistency checks like a textblock box to be fully included into page box (no negative coordinates and no coordinates bigger than page width/height). There are two main concerns in this case: there are not too many open source validation tools for 1.1 compared with 1.0 and second, if we add this into xsd validation the level of restriction would be too high and will became mandatory, creating a lot of troubles both on ALTO creators and consummers.
  2. Use a separate SCHEMATRON schema (https://en.wikipedia.org/wiki/Schematron) as an add-on to default xsd validation. This new schema can be used optionally into a validation pipeline for ALTO files for users that would like to have more restrictive checks (more into the area of quality checks, rather than structural checks)

Based on board discussions, we should continue with option 2

On this topic we would like to collect as many ideas as possible for SCHEMATRON validation in order to create a list of checks to be implemented. For each proposed test, also specify a proposal for severity level. For the moment I would propose ERROR, WARNING and INFO as possible levels, just as starting point

Currently following tests/categories of tests were proposed:

  1. Coordinates checks starting from Restrict float attribute values where possible to allow for better xml-validation. #62 and extend to all boundaries (not only positive coordinates, but also all values or combinations of values (like VPOS + HEIGHT < PAGE HEIGHT) to be inside page/printSpace/Margin boundaries
  2. Overlapping checks - even is not mandatory to have in ALTO zero overlaps, overlapping might indicate some issues
  3. Parent elements without children (for example Texline without any String inside)
  4. Any strings encodding issues
  5. Meaningfull usage of optional information - for example, even VPOS, HPOS are optional in schema, might be a good idea to outline if any of these are missing, even as errors or at least warnings
  6. Language specific checks (for example in Chinese usually each glyph should be encoded in fact as an word and two Chinese Glyphs into same word is considered incorect by some ALTO processors)

Please add your own ideas, detail test categories listed above so that we can create in the final a list of tests to be implemented and their verbosity level. SCHEMATRON schema would be optional, but should be a sort of guideline of good practices when creating ALTO files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant