Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand schema documentation for PointsType #49

Closed
artunit opened this issue Jun 24, 2018 · 29 comments
Closed

Expand schema documentation for PointsType #49

artunit opened this issue Jun 24, 2018 · 29 comments
Assignees
Milestone

Comments

@artunit
Copy link
Member

artunit commented Jun 24, 2018

PointsType in ALTO v4 has very basic documentation:

<xsd:documentation>A list of points</xsd:documentation>

It would seem clearer to explictedly surface PointsType as a list of coordinate-pairs, particularly for complex shapes and polylines. For example, using the Polygon syntax from issue 22:

<Shape>
  <Polygon POINTS="752.2 1239.1 752 1672 805 1672 805 1239"/>
</Shape>

This is arguably clearer as a list of coordinate pairs by using commas:

<Shape>
  <Polygon POINTS="752.2,1239.1 752,1672 805,1672 805,1239"/>
</Shape>

Or perhaps:

<Shape>
  <Polygon POINTS="(752.2,1239.1) (752,1672) (805,1672) (805,1239)"/>
</Shape>

The documentation might be a variation of what is used for MeasurementUnitType:

<xsd:documentation>
A list of coordinate-pairs that are absolute to the upper-left corner of a page. The upper 
left corner of the page is defined as coordinate (0,0).
</xsd:documentation>

This would seem to reduce the possibility of missing a coordinate and be more friendly to software interpretation without breaking backwards compatibility.

@artunit artunit added high priority Identified as high priority by Board and removed high priority Identified as high priority by Board labels Sep 19, 2018
@bertsky
Copy link
Contributor

bertsky commented Jul 12, 2019

Interesting. So this is even more liberal here than the specification in PAGE-XML, which uses a unified, polygon-based representation for all its coordinates. There, polygons are restricted via a regexp to contain a comma-separated, non-negative list with at least two point pairs. So the syntax is well standardized. (This was introduced in 2013.) But there is no specification (or even comment/recommendation) regarding semantics. Not even a description of the absolute x-y pixel coordinate system based on the upper left corner (like in ALTO's MeasurementUnitType).

IMO the following aspects should be addressed in the schema:

  1. exact coordinate syntax, enforcable via validators – missing in ALTO, present in PAGE
  2. coordinate system – present in ALTO, missing in PAGE
    • relative to (possibly rotated) parent element vs.
    • absolute for page image with origin and unit
  3. topology – missing
    • unordered points vs.
    • single open path vs.
    • multiple closed paths, orientation inside vs. outside (on left/right of path)
  4. constraints (or comments?) like
    • are paths allowed to exceed/leave the element's bounding box or even the page's bounding box (i.e. become negative), and if not: must they be closed along the bbox or may they stay open?
    • are paths required to be planar (i.e. have no cross-sections), and if not: how does the area compute,
      • by union vs.
      • by difference vs.
      • by orientation (left-of-path or right-of-path)?

Care should be taken to be as best compatible to existing implementations as is consistently possible.

@cipriandinu
Copy link
Member

As discussed into the last meeting we should keep back-compatibility and do not change the schema. Nevertheless, before closing the topic would be good as original topic mentioned, to give a guideline of string format. I will add a proposal for documentation and then set the topic for voting

@stweil
Copy link
Contributor

stweil commented Sep 6, 2022

I just found this issue because of the different handling of POINTS in ABBYY Finereader and kraken:

# ABBYY Finereader
<Polygon POINTS="159,837 2414,837 2414,1038 159,1038 159,837"/></Shape>

## kraken
<Polygon POINTS="154 828 155 965 155 965 154 828"/>

The current kraken code does not understand the ABBYY variant.

@cipriandinu
Copy link
Member

Description was updated for more clarity and some guidelines. Changes are here: 4a301be
Please review and vote/propose changes

@bertsky
Copy link
Contributor

bertsky commented Sep 16, 2022

schema/v4/alto-4-4.xsd

Lines 702 to 712 in 4a301be

<xsd:simpleType name="PointsType">
<xsd:annotation>
<xsd:documentation>A list of coordinate-pairs that are absolute to the upper-left corner of a page.</xsd:documentation>
<xsd:documentation>The upper left corner of the page is defined as coordinate (0,0)</xsd:documentation>
<xsd:documentation>Even there are no rules to enforce a particular format for a points list recommended formats are:</xsd:documentation>
<xsd:documentation>"x1 y1 x2 y2 ... xn yn"</xsd:documentation>
<xsd:documentation>"x1,y1 x2,y2 ... xn,yn"</xsd:documentation>
<xsd:documentation>"(x1 y1) (x2 y2) ... (xn yn)"</xsd:documentation>
<xsd:documentation>"(x1,y1) (x2,y2) ... (xn,yn)"</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string"/>

So it's going to be laissez-faire all the way then? This will make it hard for implementors, though. (Conversely, users will have to pay the price of that design decision by ending up with incompatibilities.)

IMHO adding a few format restrictions retroactively (beginning with 4.4) – still allowing what has been in use overwhelmingly, but precluding new/rare formats – would be a better choice. For example:

 	<xsd:restriction base="xsd:string"> 
		<pattern value="([0-9.]+ [0-9.]+ )+([0-9.]+ [0-9.]+)"/>
		<pattern value="([0-9.]+,[0-9.]+ )+([0-9.]+,[0-9.]+)"/>
 	</xsd:restriction>

(One could debate whether the decimal point must be allowed as well, and whether it must be at least 1 point or at least 3, which can be expressed by {3,} instead of +.)

@cipriandinu
Copy link
Member

@bertsky, maybe would be a good idea to do this in multiple steps, since the proposed change you made will break back-compatibility and if we decide to implement it, will be anyway part of a major release (5.0). I agree is better to have something enforced, the only concern on the last meeting was related to compatibility and what is prefered (keep it even solution is not the best, or break it with future advantages). We could release on 4.4 only the recommendation for usage, so that users have the chance to comply with this recommendation till we will add the enforced rule in 5.0. Regarding the regular expression itself, should be a bit different since [0-9.]+ will match also something like 89...989..97.99 which is wrong. Maybe something like [0-9]+.?[0-9]* for a coordinate - will not match .89 but if we use * instead of + when we create expression for a pair then we could easily match any single space with that pattern.

@mittagessen
Copy link
Contributor

mittagessen commented Sep 16, 2022

@bertsky Your pattern is insufficient to capture valid floating point values like 1.2344e5 and +2.54.

Personally, I'd prefer limiting it to one valid representation and then bumping up the major version even though it breaks backwards compatibility.

@bertsky
Copy link
Contributor

bertsky commented Sep 16, 2022

@cipriandinu @mittagessen I agree with your assessments.

@cipriandinu
Copy link
Member

Set back to discussion for next meeting - new proposal:

  1. Update documentation as proposed in 4.4 (documentation should reflect exactly the intended changes for 5.0)
  2. Add proper restrictions into 5.0 since those changes could break back-compatibility

@cipriandinu
Copy link
Member

According with the last meeting results we will split the topic into two parts as proposed before:

  1. Update documentation as proposed in 4.4 (documentation should reflect exactly the intended changes for 5.0) - we will keep this thread for this
  2. Add proper restrictions into 5.0 since those changes could break back-compatibility - new topic Restrict PointsType to a well defined format #80

@cipriandinu
Copy link
Member

In order to decide how restrictive should be rule for pointsType we should vote for three options:

  1. High restriction level: we will allow maximum two options (not necessary the ones bellow, but only 2):
    xsd:documentation"x1 y1 x2 y2 ... xn yn"</xsd:documentation>
    xsd:documentation"x1,y1 x2,y2 ... xn,yn"</xsd:documentation>
  2. Medium restruction level: we will allow four options:
    xsd:documentation"x1 y1 x2 y2 ... xn yn"</xsd:documentation>
    xsd:documentation"x1,y1 x2,y2 ... xn,yn"</xsd:documentation>
    xsd:documentation"(x1 y1) (x2 y2) ... (xn yn)"</xsd:documentation>
    xsd:documentation"(x1,y1) (x2,y2) ... (xn,yn)"</xsd:documentation>
  3. Low restriction level: we have a restriction, but allow many options (6-8 or even more) - example bellow just as idea, not necessary complete/recommended list:
    xsd:documentation"x1 y1 x2 y2 ... xn yn"</xsd:documentation>
    xsd:documentation"x1,y1 x2,y2 ... xn,yn"</xsd:documentation>
    xsd:documentation"(x1 y1) (x2 y2) ... (xn yn)"</xsd:documentation>
    xsd:documentation"(x1,y1) (x2,y2) ... (xn,yn)"</xsd:documentation>
    xsd:documentation"[x1 y1] [x2 y2] ... [xn yn]"</xsd:documentation>
    xsd:documentation"[x1,y1] [x2,y2] ... [xn,yn]"</xsd:documentation>
    xsd:documentation"{x1 y1} {x2 y2} ... {xn yn}"</xsd:documentation>
    xsd:documentation"{x1,y1} {x2,y2} ... {xn,yn}"</xsd:documentation>
    ... any other idea like etc

Voting for this topic would be an comment with a simple text: "Option 1" or "Option 2" or "Option 3"

@cipriandinu
Copy link
Member

Option 2

@ntra00
Copy link
Member

ntra00 commented Oct 14, 2022

Option 2

@stweil
Copy link
Contributor

stweil commented Nov 10, 2022

Option 2 with additional remark:

From the documentation: "The upper left corner of the page is defined as coordinate (0,0)".

Is it possible to say that "(x1,y1) (x2,y2) ... (xn,yn)" is the preferred variant? That variant fits best to (0,0).
And also say that "x1 y1 x2 y2 ... xn yn" should be avoided because it can become difficult for humans to identify a pair somewhere in the middle of a lengthy list of points?

@cowboyMontana
Copy link
Member

cowboyMontana commented Dec 15, 2022

Option 2. Changed to option 1 with comma separated coordinates on February 16. 2023.

@c-sebastien
Copy link

Option 2, agreeing with the remark of @stweil

@bertsky
Copy link
Contributor

bertsky commented Jan 10, 2023

Option 1 IMHO

Has anyone actually seen existing implementations already using parenthesis or brackets? (If not, let's not encourage this new paradigm!)

@cipriandinu
Copy link
Member

@stweil - would be better to have a more neutral comment like: "The upper left corner of the page is defined as x=0 and y=0" - then we do not give any hint about what is prefferable and what not? I hope I properly understood your comment. Or you would like to have more clear indication on what is preffered and what not?

@stweil
Copy link
Contributor

stweil commented Jan 27, 2023

If option 1 is chosen, I'd suggest to mark the variant with commas as the preferred one. Citing @artunit: "This is arguably clearer as a list of coordinate pairs by using commas".

@cipriandinu cipriandinu added this to the v4.4 milestone Jan 27, 2023
@cneud
Copy link
Member

cneud commented Feb 14, 2023

My vote would be for Option 1 and also a recommendation on the use of commas to aid with readability actually.

@callylaw
Copy link
Member

I would vote for option 2

@rajubln
Copy link

rajubln commented Feb 16, 2023

I would vote Option 2

@cipriandinu
Copy link
Member

Option 1 IMHO

Has anyone actually seen existing implementations already using parenthesis or brackets? (If not, let's not encourage this new paradigm!)

Even I voted for Option 2, this is a good point. I agree we should not encourage this if indeed nobody used in the past brackets. Maybe we should go with 1 and see if there is any reaction on ALTO mail list when we will announce the proposal for 4.4 (before officially launch the version)

@Haighton
Copy link

I agree with Stweil, Option 1 with the recommendation to use comma's for readability.

@cowboyMontana
Copy link
Member

cowboyMontana commented Feb 16, 2023

option 1 with comma separated coordinates

@JLoitzenbauer-CRKN
Copy link

Option 1 with comma looks good.

@cipriandinu
Copy link
Member

Based on your votes and last ALTO Board discussions the option 1 was selected. Here is the documentation proposal:

<xsd:simpleType name="PointsType">
xsd:annotation
xsd:documentationA list of coordinate-pairs that are absolute to the upper-left corner of a page.</xsd:documentation>
xsd:documentationThe upper left corner of the page is defined as x=0 and y=0</xsd:documentation>
xsd:documentationCurrently there are no rules to enforce a particular format for a points list but in future versions is planned to restrict it to following options:</xsd:documentation>
xsd:documentation"x1,y1 x2,y2 ... xn,yn" - highly recommended as widely used and easy to read by both human and machine</xsd:documentation>
xsd:documentation"x1 y1 x2 y2 ... xn yn" - kept for back compatibility, since currently there are tools using this format</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string"/>
</xsd:simpleType>

@cipriandinu
Copy link
Member

<xsd:simpleType name="PointsType"> <xsd:annotation> <xsd:documentation>A list of coordinate-pairs that are absolute to the upper-left corner of a page.</xsd:documentation> <xsd:documentation>The upper left corner of the page is defined as x=0 and y=0</xsd:documentation> <xsd:documentation>Currently there are no rules to enforce a particular format for a points list but in future versions is planned to restrict it to following options:</xsd:documentation> <xsd:documentation>"x1,y1 x2,y2 ... xn,yn" - highly recommended as widely used and easy to read by both human and machine</xsd:documentation> <xsd:documentation>"x1 y1 x2 y2 ... xn yn" - kept for back compatibility, since currently there are tools using this format</xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"/> </xsd:simpleType>

@cipriandinu
Copy link
Member

4.4 released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests