Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO support for encoding OCR segmentation ambiguity #63

Open
artunit opened this issue Nov 5, 2019 · 6 comments
Open

ALTO support for encoding OCR segmentation ambiguity #63

artunit opened this issue Nov 5, 2019 · 6 comments

Comments

@artunit
Copy link
Member

artunit commented Nov 5, 2019

At the 2019-05-07 ALTO face-to-face Board Meeting there was support for identifying a a standard and interoperable lattice structure for encoding OCR uncertainty and alternative hypotheses, and the discussion since then has largely been attached to issue 57 - Allow a String to contain alternative Glyph segmentation hypotheses. Significant progress has been made, including an extensive proposal from @bertsky, and with momentum from the 2019-11-03 f2f meeting, it makes sense to move the discussion to a new issue. Further updates will be tracked here.

@artunit artunit changed the title ALTO support for encoding OCR uncertainty ALTO support for encoding OCR uncertainty and alternative hypotheses Nov 5, 2019
@bertsky
Copy link
Contributor

bertsky commented Nov 7, 2019

Note: Above link was merely the example, the actual proposal was a few comments before that.

In my view, the most significant progress over that was the aspect of backwards compatibility ("lattice opt-in") which had been brought up by @artunit and for which @cipriandinu delivered a viable solution:

To not let Lattice alternate with String and SP in the sequence that constitutes a TextLine (representing both local or global ambiguity obligatorily), but to have it in a separate position after HYP (thus representing global ambiguity optionally). The semantic relation between the Lattice and the String and SP sequence would then be that the latter becomes the single-best path of the former (or, if StringVariant where allowed in String as well, the single-best chain / confusion network / sausage graph). Moreover, Lattice would also be allowed inside String just below the Glyph sequence (representing local ambiguity optionally). Formally,

<xsd:element name="TextLine" maxOccurs="unbounded">
  ... <!-- documentation -->
  <xsd:complexType>
    <xsd:sequence>
      ... <!-- Shape sequence -->
      <!-- old representation: -->
      <xsd:sequence maxOccurs="unbounded">
        <xsd:sequence>
          <xsd:element name="String" type="StringType"/>
          <xsd:element name="SP" type="SPType" minOccurs="0"/>
        </xsd:sequence>
        <xsd:element name="HYP" minOccurs="0">
          ... <!-- HYP type -->
        </xsd:element>
        <!-- new representation: -->
        <xsd:element name="Lattice" type="LatticeType" minOccurs="0"/>
      </xsd:sequence>
    ... <!-- attributes etc -->
  </xsd:complexType>
...
</xsd:element>

for global (word segmentation and glyph segmentation) ambiguity and

<xsd:complexType name="StringType" mixed="false">
  ... <!-- documentation -->
  <xsd:sequence minOccurs="0">
    <!-- old representation: -->
    <xsd:element name="Shape" type="ShapeType" minOccurs="0" maxOccurs="1"/>
    <xsd:element name="ALTERNATIVE" type="ALTERNATIVEType" minOccurs="0" maxOccurs="unbounded"/>
    <xsd:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
    <!-- new representation: -->
    <xsd:element name="Lattice" type="LatticeType" minOccurs="0"/>
  </xsd:sequence>
  ... <!-- attributes etc -->
</xsd:complexType>

for local (glyph segmentation) ambiguity.

The actual Lattice definition stays the same:

<xsd:complexType name="LatticeType">
  <xsd:sequence>
    <!-- re-use GlyphType for nodes (but allow it to be used for white space as well): -->
    <xsd:element name="Glyph" type="GlyphType" maxOccurs="unbounded"/>
    <!-- introduce a simple edge type: -->
    <xsd:element name="Span" type="SpanType" minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
  <xsd:attribute name="ID" type="xs:ID" use="optional"/>
  <!-- for convenience, summarize initial nodes (yes, plural here): -->
  <xsd:attribute name="begin" type="xs:IDREFS" use="required"/>
  <!-- for convenience, summarize terminal nodes (yes, plural here): -->
  <xsd:attribute name="end" type="xs:IDREFS" use="required"/>
</xsd:complexType>
<xsd:complexType name="SpanType">
  <!-- ID of incoming Glyph: -->
  <xsd:attribute name="begin" type="xs:IDREF" use="required"/>
  <!-- ID of outgoing Glyph: -->
  <xsd:attribute name="end" type="xs:IDREF" use="required"/>
</xsd:complexType>

Practically, since IDREFs have document-wide scope, the Span edges in the lattice could either re-use existing Glyph elements outside the Lattice element, or point to additional Glyph elements inside of it.

Regarding the fear of "exploding" file size, lattices can always be pruned to reduce the size (and precision). This is a trade-off which users will have to make based on their use-case. (And it's a trade-off that n-best lists offer, too, but extremely inefficient because they don't avoid the combinatorial explosion.)

However, all this is also related to #54, because:

  • it makes most sense to encode white space explicitly with Glyph within Lattice anyway, so allowing Glyph or at least @CONTENT in SPType would harmonize both perspectives; but
  • if the existing standard was to be re-interpreted as allowing StringType to be out of order (and overlapping) under TextLine instead of a path/sequence, then the need for a lattice extension would become much less urgent: Under that interpretation, word segmentation ambiguity can already be represented by a set of String in the TextLine, leaving only glyph segmentation ambiguity to be represented by a Lattice in String.

So we should wait for that issue to be resolved on its own, and then continue here.

BTW, I liked the old title better. Now that the StringVariantType proposal will live on independently in #57, we have the paradoxical situation that the lattice solution to be further discussed here (being the only one which can truly represent segmentation ambiguity, whereas StringVariantType is, in essence, n-best lists) does not actually run with the term alternative segmentation in the title. In contrast, the existing standard already covers OCR uncertainty (VC) and alternative hypotheses (VariantType) to a large degree (namely minus segmentation ambiguity). So how about support for encoding OCR segmentation ambiguity?

@artunit artunit changed the title ALTO support for encoding OCR uncertainty and alternative hypotheses ALTO support for encoding OCR segmentation ambiguity Nov 7, 2019
@artunit
Copy link
Member Author

artunit commented Nov 7, 2019

Looks good! I have changed the title.

@cipriandinu
Copy link
Member

An additional idea might be to give some probabilities for each Span:

<xsd:complexType name="SpanType">

<xsd:attribute name="begin" type="xs:IDREF" use="required"/>

<xsd:attribute name="end" type="xs:IDREF" use="required"/>

<xsd:attribute name="probability" ref="probability_attr" use="optional"
</xsd:complexType>

<xs:attribute name="probability_attr">
xs:simpleType
<xs:restriction base="xs:decimal">
<xs:minExclusive value="0.0"/>
<xs:maxInclusive value="1.0"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>

The highest probabilities product should lead to the best option encoded as string/textline. On the other hand these probabilities represent the initial assessment made by recognition engine and an ALTO processor may ignore these and apply its own probabilities that may lead to a different best option (real benefit is that in time, based on same Lattice, better models can be applied in order to rebuild the probabilities and generate a different and maybe more accurate best result)

@bertsky
Copy link
Contributor

bertsky commented Dec 13, 2019

Thanks @artunit!

To complement the discussion on a possible lattice extension to fully represent OCR ambiguity with a perspective more inclined to an extension based on the confidence/confusion matrix (henceforth, confmat), here are some arguments provided by @gundramleifert, grouped by different aspects:

Technological dependency

  • lattice: generic/agnostic. Works for all kinds of OCR approaches known so far (segmentation+feature-based classification, HMM, RNN+CTC, RNN-Seq2Seq)
  • confmat: specific to HMM and RNN+CTC

Still unclear:

  • will new technologies support confmat?
  • will RNN-Seq2Seq surpass RNN+CTC?

Generality

A lattice can be used to represent a confmat, too: by adding or prolonging a character edge for each output channel at each position. The null/not-a-character channel must be a special or gap symbol. So it's possible, but extremely inefficient – yet again, one could apply pruning to the desired degree.

Interdependence

With RNN+CTC, lattices are generated from the confmat via CTC approximation. Lattices often implicitly rely on a language model (LM) for pruning, whereas confmats don't. Once you have pruned a lattice with some LM, you lose some information (which even a larger/better LM could not recover). With confmats on the other hand, decoding and LM scoring are always defered to the runtime (so you can still decide on the LM later without loosing information).

Decoding the confmat might require knowledge of the specific encoding of the original model at runtime. (Channels are not necessarily idempotent to characterset, especially for large scripts.) And the runtime cost of CTC approximation for confmats will usually be larger than the cost of Viterbi search for lattices.

Memory efficiency

Without information loss

  • lattice: impossible (grows exponentially with line length)
  • confmat: normal (grows linearly with line length)

With information loss

  • lattice: pruning as trade-off between space and completeness; possibly deviating objectives for different use-cases (transcription vs search)
  • confmat: can be compressed and/or cut-off as well, but trade-off is always better (cf. space complexity above)

@artunit
Copy link
Member Author

artunit commented Dec 16, 2019

Thanks so much for this analysis! At the 2019-12-13 ALTO Board Meeting, the idea was put forward that we do a single-topic meeting on this discussion in January 2020 since it is high priority and also connects to the important confidence value issue. I will follow up with you and @gundramleifert to see if there is opportune timing in the last 2 weeks of January (Jan. 20-) for us to put together a virtual gathering. If we can make the arrangements, I will post the zoom link here.

@artunit artunit added the high priority Identified as high priority by Board label Dec 16, 2019
@urieli
Copy link

urieli commented Feb 10, 2020

@bertsky Can you give a link to the confusion matrix discussion? And/or show how a confusion matrix would be encoded in an XSD/sample XML?

@cipriandinu cipriandinu removed the high priority Identified as high priority by Board label Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants