Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO - PAGE xml: Object mapping and possible transformation generation #48

Open
Jo-CCS opened this issue May 2, 2018 · 7 comments
Open
Assignees

Comments

@Jo-CCS
Copy link
Member

Jo-CCS commented May 2, 2018

On face-2-face conference in Vienna the idea came up to generate a conversion between PAGE and ALTO as best-practice mapping between the different standard objects.
If feasible, a transformation could be provided by XSLT.

The idea is to create a mapping on the latest ALTO version 4 to upcoming PAGE version in June and from there going backwards as far this makes sense.

Target is to get a common solution for mapping especially for objects where no exact matching is possible and workarounds or compromises need to be defined.

@chris1010010
Copy link

chris1010010 commented May 2, 2018

Document with list of features here:
Doc

@chris1010010
Copy link

chris1010010 commented Nov 6, 2019

I made a start here:
prima-core-libs (Java)
(XmlPageWriter_Alto.java)
It can already convert the main things such as blocks, text lines, strings and glyphs with shapes. But there are many ToDos.

Some issues that need discussing:

  • Margins (LeftMargin, TopMargin etc.). How much are those used in practice? We could approximate by using bounding boxes.
  • SP element. How is that used typically?
  • HYP element. Difficult to do. We could look for hyphens in the text content. But is every hyphen at the end of a text line a HYP?
  • Text CONTENT. At the moment I assume top-to-bottom text line order and left-to-right word/glyph order to determine string/glyph content (needed if text is stored in regions or text lines in the PAGE file)
  • GraphicalElement type. According to the schema documentation this is for separating lines and rectangles. So for now I only map PAGE Separator to this. Anything other non-text is mapped to Illustration. Regions that have child regions are mapped to ComposedBlock.

The idea is to extend the JPageConverter to accept ALTO as target format. Already added but not tested:
https://github.com/PRImA-Research-Lab/prima-page-converter

@cneud
Copy link
Member

cneud commented Nov 6, 2019

@chris1010010 This is great for a head start, many thanks! I will also circulate this within the @OCR-D community for comments and contributions.

@chris1010010
Copy link

@cneud
Happy to discuss priorities and sharing of work to keep the momentum. Thorough testing is a big chunk of work that can be easily distributed.

@chris1010010
Copy link

I made some progress in the Java converter. Open issues: SP, HYP, margins

@cneud
Copy link
Member

cneud commented Feb 14, 2020

FYI there is also ongoing work in the German OCR SIG to complete what Christian started, cf. https://github.com/maxnth/page-alto-ressources and https://github.com/maxnth/prima-core-libs/branches

@artunit
Copy link
Member

artunit commented May 5, 2021

As per the 2021-04-29 Board Meeting, I am linking the ocrd-page-to-alto TODO list here, which gives a nice summary of missing equivalencies. Kudos to everyone who has worked on this.

@cipriandinu cipriandinu removed the high priority Identified as high priority by Board label Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants