Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Length of main glyph and variants #44

Open
splet opened this issue Mar 2, 2017 · 5 comments
Open

Length of main glyph and variants #44

splet opened this issue Mar 2, 2017 · 5 comments
Assignees

Comments

@splet
Copy link

splet commented Mar 2, 2017

Separated from #26 (comment)
Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.

@splet splet self-assigned this Mar 2, 2017
@Jo-CCS
Copy link
Member

Jo-CCS commented Mar 3, 2017

This was discussed on the technical sessions and I think is also explained by the statement of Jean-Philip, that the main glyph should be the one sign and should be limited to 1 to prevent misusage / wrong interpretation for having multiple characters bound to one glyph and then having all kind of possible combinations for the alternatives.
See #26 (comment)

@artunit
Copy link
Member

artunit commented Sep 28, 2019

In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested.

@artunit artunit closed this as completed Sep 28, 2019
@bertsky
Copy link
Contributor

bertsky commented Feb 15, 2021

The change proposed by @Jo-CCS and adopted into 4.0-4.2 includes this detail of restricting "character" length that seems overly restrictive to me, not just with respect to OCR results, but on principal grounds: In some languages / scripts, not all relevant characters can be represented by a single Unicode codepoint (not to be confused with Glyph or grapheme cluster), but that's what the schema enforces:

<xsd:length fixed="true" value="1"/>

Scripts like Arabic, Hebrew, Devanagari and Bengali heavily rely on combining mark sequences, and even for European languages (esp. in historic texts) there's not always a precomposed codepoint available. For example, German umlauts äöü cannot only be decomposed as äöü (with combining trema), but also as aͤoͤuͤ (with combining e). Same with other rare diacritics. One could argue the same for fractions, where only a few like ¾ ⅔ are available precomposed, the others need to be decomposed 3⁄4 2⁄3.

Please re-open.

@mikegerber
Copy link

I agree, this should be re-opened. Some glyphs we have in historic prints, like aͤ (LATIN SMALL LETTER A + COMBINING SMALL LETTER E) cannot be represented in a single Unicode code point and the cited XML Schema restriction does not allow us to save them in a valid ALTO document.

@artunit artunit reopened this Feb 16, 2021
@artunit
Copy link
Member

artunit commented Feb 16, 2021

Thanks for the comments, this issue is reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants