incorrect extraction of deep text from a document with corrections #49

kosloot · 2022-10-06T17:33:11Z

the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence.
This came up in: LanguageMachines/foliautils#66

When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together.
Example (rather braindead, but is proves the point)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-06T12:10:53" command="ucto -X -L deu --textredundancy=full --id Walter bug.in bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-06T12:11:06" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="Walter.text">
    <p xml:id="Walter.p.1">
      <t>chat... Von</t>
      <s xml:id="Walter.p.1.s.1">
        <t>chat...</t>
        <w xml:id="Walter.p.1.s.1.w.1" class="WORD" processor="ucto.1" space="no">
          <t>chat</t>
        </w>
        <correction xml:id="Walter.p.1.s.1.correction.1">
          <new>
            <w xml:id="Walter.p.1.s.1.w.3.edit.1" processor="FoLiA-correct.1">
              <t>...</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="Walter.p.1.s.1.w.3" class="PUNCTUATION-MULTI" processor="ucto.1">
              <t>...</t>
            </w>
          </original>
        </correction>
      </s>
      <s xml:id="Walter.p.1.s.2">
        <t>Von</t>
        <w xml:id="Walter.p.1.s.2.w.1" class="WORD" processor="ucto.1">
          <t>Von</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

When parsing this file, withe folialint:

bug.xml failed: inconsistent text: node p(Walter.p.1) has a mismatch for the text in set:current
the element text ='chat... Von'
 the deeper text ='chat...Von'

The text was updated successfully, but these errors were encountered:

proycon · 2022-10-06T17:37:22Z

I indeed get something similar with foliapy:

$ foliavalidator issue49.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in issue49.folia.xml
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent <text> @ parent line 34) : [InconsistentText] Text for <Sentence at 140195206213216 id=Walter.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
chat
****> BUT FOUND (strict text after normalization) ****>
chat...
******* DEVIATION POINT: <*HERE*>chat...
(also checked against older rules prior to FoLiA v2.4.1)

proycon · 2022-10-06T17:40:12Z

The foliapy error is correct though:

the text of sentence Walter.p.1.s.1 ends in an ellipsis
but the correction in fact removes the ellipsis (that means the text on the higher level shouldn't have it either)

Seems a bit different from the error in libfolia.

kosloot · 2022-10-06T17:45:15Z

Ok, my bad.
I corrected the example to have the same ellipsis in <new>. (silly but ok)
folialint gives still the same error (but foliavalidator validates it)

kosloot · 2022-10-07T16:10:35Z

A somewhat shaky solution is committed now. Needs testing

kosloot · 2022-10-08T07:28:32Z

This fix enables @martinreynaert to run his corrections, but also AGAIN shows a difference of opinions between libfolia and FoLiaPY.

Running FoLiA-correct on only the first part of the title of the text file already reveals this, The produced folia is rejected by voliavalidator, but folialint is satisfied. The latter being wrong, imnsho.

The test file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bug" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-08T08:48:33" command="ucto -X -L deu --textredundancy=full --id bug bug.txt bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-08T08:49:07" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="bug.text">
    <p xml:id="bug.p.1">
      <t>Walter Muschg Freud</t>
      <t class="Ticcl">Walter musch Freud</t>
      <s xml:id="bug.p.1.s.1">
        <t>Walter Muschg Freud</t>
        <t class="Ticcl">Walter musch Freud</t>
        <w xml:id="bug.p.1.s.1.w.1" class="WORD" processor="ucto.1">
          <t>Walter</t>
          <t class="Ticcl" offset="0">Walter</t>
        </w>
        <correction xml:id="bug.p.1.s.1.correction.1">
          <new>
            <w xml:id="bug.p.1.s.1.w.2.edit.1" processor="FoLiA-correct.1">
              <t class="Ticcl" offset="7">musch</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="bug.p.1.s.1.w.2" class="WORD" processor="ucto.1">
              <t>Muschg</t>
            </w>
          </original>
        </correction>
        <w xml:id="bug.p.1.s.1.w.3" class="WORD" processor="ucto.1">
          <t>Freud</t>
          <t class="Ticcl" offset="13">Freud</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

& folialint --nooutput bug.ticcl.folia.xml
Validated successfully: bug.ticcl.folia.xml

foliavalidator bug.ticcl.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in bug.ticcl.folia.xml
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent @ parent line 34) : [InconsistentText] Text for
<Sentence at 140336671748544 id=bug.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Walter Freud
****> BUT FOUND (strict text after normalization) ****>
Walter Muschg Freud
******* DEVIATION POINT: Walter <HERE>Muschg Fre
(also checked against older rules prior to FoLiA v2.4.1)

@proycon I remember that issues like this have been discussed before. like
in proycon/folia#98
and proycon/folia#75

But the argument has not been settled, it seems. And I agree that it is a difficult problem.

kosloot · 2022-10-08T22:32:31Z

note also related to: proycon/folia#100 which is deemed Low Priority unfortunately

kosloot added the bug label Oct 6, 2022

kosloot assigned proycon and kosloot Oct 6, 2022

kosloot added a commit that referenced this issue Oct 7, 2022

attempted fix for #49. Uses new select_set() function

46f8f31

kosloot added a commit that referenced this issue Oct 7, 2022

better fix for #49

8d9614e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect extraction of deep text from a document with corrections #49

incorrect extraction of deep text from a document with corrections #49

kosloot commented Oct 6, 2022 •

edited

Loading

proycon commented Oct 6, 2022

proycon commented Oct 6, 2022

kosloot commented Oct 6, 2022

kosloot commented Oct 7, 2022

kosloot commented Oct 8, 2022 •

edited

Loading

kosloot commented Oct 8, 2022

incorrect extraction of deep text from a document with corrections #49

incorrect extraction of deep text from a document with corrections #49

Comments

kosloot commented Oct 6, 2022 • edited Loading

proycon commented Oct 6, 2022

proycon commented Oct 6, 2022

kosloot commented Oct 6, 2022

kosloot commented Oct 7, 2022

kosloot commented Oct 8, 2022 • edited Loading

kosloot commented Oct 8, 2022

kosloot commented Oct 6, 2022 •

edited

Loading

kosloot commented Oct 8, 2022 •

edited

Loading