Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline Representation: Sections by function, not class #71

Merged
merged 5 commits into from
Oct 21, 2016
Merged

Conversation

kba
Copy link
Owner

@kba kba commented Oct 21, 2016

#51

Different space widths should be indicated using HTML and ` `, `&emsp`,
` `, `‌`, `‍`.

### Hyphenation
Hyphenation {#hyphenation}
-----------

Soft hyphens must be represented using the HTML `­` entity.

The HTML <a href="https://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2.5">`&lrm;` and
`&rlm;` entities</a> (indicating writing direction) must not be used; all
writing direction changes must be indicated with tags.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be under 'Writing Direction' header

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, doesn't fit in hyphenation. Maybe move this to
https://kba.github.io/hocr-spec/1.2/#font-lang, replace "with tags" with "dir= attribute" and reference https://kba.github.io/hocr-spec/1.2/#valdef-ocr-capabilities-ocrp_dir?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

<li><a href="#sub-sup"><span class="secno">6.4</span> <span class="content">Superscript and Subscript</span></a>
<li><a href="#whitespace"><span class="secno">6.5</span> <span class="content">Whitespace</span></a>
<li><a href="#hyphenation"><span class="secno">6.6</span> <span class="content">Hyphenation</span></a>
<li><a href="#ruby"><span class="secno">6.7</span> <span class="content">Ruby characters</span></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also combine some of these sections under one new section, i.e.

HTML entities and Unicode

  • Non-breaking spaces must be represented using the HTML &nbsp; entity.
  • Different space widths should be indicated using HTML and &ensp;, &emsp;, &thinsp;, &zwnj;, &zwj;.
  • Soft hyphens must be represented using the HTML &shy; entity.
  • The HTML &lrm; and &rlm; entities (indicating writing direction) must not be used; all writing direction changes must be indicated with tags.
  • Furigana and similar constructs must be represented using their correct Unicode encoding.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these aspects are worth their own section. e.g. for Whitespace: explain whether repeated whitespace is meaningful, if non-tabular aligned text should use tabs. For hyphenation, whether that's the only encoding (e.g. altoxml/schema#41).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyphens are also mentioned in http://kba.github.io/hocr-spec/1.2/#hardbreak

Besides ruby also other special entities are mentioned in the article:

For example, HTML and CSS provide
support for representing fonts, styles, hyphenation,
flexible spacing, justification, kashida (flexible Arabic
characters), Urdu ligatures, Japanese ruby, mixed hor-
izontal/vertical layout, inline changes in writing direc-
tion, and many others.

However, I am also fine with more subsections.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the paragraph there before and probably will again once I get to the fonts/language section :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#60


Non-breaking spaces must be represented using the HTML `&nbsp;` entity.

### Non-default spaces

Different space widths should be indicated using HTML and `&ensp;`, `&emsp`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon missing in &emsp;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


Superscripts and subscripts, when not in <{ocr_math}> or <{ocr_chem}> formulas,
must be represented using the HTML `<sup>` and `<sub>` tags, even if special
must be represented using the HTML <{sup}> and <{sub}> tags, even if special
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links are not working and I am not sure there is anything we can link to...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kba kba merged commit 251d66e into master Oct 21, 2016
@kba kba deleted the inline-repr branch October 21, 2016 20:22
@amitdo
Copy link
Collaborator

amitdo commented Oct 23, 2016

<sub> and <sup> for html 4.01:
https://www.w3.org/TR/html401/struct/text.html#h-9.2.3

@kba
Copy link
Owner Author

kba commented Oct 23, 2016

I just find the HTML5 standard way better. I know we have

all tags should be used for the intended purpose (and only for the intended purpose) as defined in the [HTML40] spec.

in there, but we should rather change that than link to an old spec with bad examples:

      H<sub>2</sub>O
      E = mc<sup>2</sup>
      <SPAN lang="fr">M<sup>lle</sup> Dupont</SPAN>

The first two should not use sub/sup at all. None of the tags should be upper-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants