Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect extraction of deep text from a document with corrections #49

Open
kosloot opened this issue Oct 6, 2022 · 6 comments
Open
Assignees
Labels

Comments

@kosloot
Copy link
Contributor

kosloot commented Oct 6, 2022

the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence.
This came up in: LanguageMachines/foliautils#66

When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together.
Example (rather braindead, but is proves the point)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-06T12:10:53" command="ucto -X -L deu --textredundancy=full --id Walter bug.in bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-06T12:11:06" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="Walter.text">
    <p xml:id="Walter.p.1">
      <t>chat... Von</t>
      <s xml:id="Walter.p.1.s.1">
        <t>chat...</t>
        <w xml:id="Walter.p.1.s.1.w.1" class="WORD" processor="ucto.1" space="no">
          <t>chat</t>
        </w>
        <correction xml:id="Walter.p.1.s.1.correction.1">
          <new>
            <w xml:id="Walter.p.1.s.1.w.3.edit.1" processor="FoLiA-correct.1">
              <t>...</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="Walter.p.1.s.1.w.3" class="PUNCTUATION-MULTI" processor="ucto.1">
              <t>...</t>
            </w>
          </original>
        </correction>
      </s>
      <s xml:id="Walter.p.1.s.2">
        <t>Von</t>
        <w xml:id="Walter.p.1.s.2.w.1" class="WORD" processor="ucto.1">
          <t>Von</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

When parsing this file, withe folialint:

bug.xml failed: inconsistent text: node p(Walter.p.1) has a mismatch for the text in set:current
the element text ='chat... Von'
 the deeper text ='chat...Von'
@proycon
Copy link
Member

proycon commented Oct 6, 2022

I indeed get something similar with foliapy:

$ foliavalidator issue49.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in issue49.folia.xml
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent <text> @ parent line 34) : [InconsistentText] Text for <Sentence at 140195206213216 id=Walter.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
chat
****> BUT FOUND (strict text after normalization) ****>
chat...
******* DEVIATION POINT: <*HERE*>chat...
(also checked against older rules prior to FoLiA v2.4.1)

@proycon
Copy link
Member

proycon commented Oct 6, 2022

The foliapy error is correct though:

  • the text of sentence Walter.p.1.s.1 ends in an ellipsis
  • but the correction in fact removes the ellipsis (that means the text on the higher level shouldn't have it either)

Seems a bit different from the error in libfolia.

@kosloot
Copy link
Contributor Author

kosloot commented Oct 6, 2022

Ok, my bad.
I corrected the example to have the same ellipsis in <new>. (silly but ok)
folialint gives still the same error (but foliavalidator validates it)

kosloot added a commit that referenced this issue Oct 7, 2022
@kosloot
Copy link
Contributor Author

kosloot commented Oct 7, 2022

A somewhat shaky solution is committed now. Needs testing

@kosloot
Copy link
Contributor Author

kosloot commented Oct 8, 2022

This fix enables @martinreynaert to run his corrections, but also AGAIN shows a difference of opinions between libfolia and FoLiaPY.

Running FoLiA-correct on only the first part of the title of the text file already reveals this, The produced folia is rejected by voliavalidator, but folialint is satisfied. The latter being wrong, imnsho.

The test file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bug" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-08T08:48:33" command="ucto -X -L deu --textredundancy=full --id bug bug.txt bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-08T08:49:07" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="bug.text">
    <p xml:id="bug.p.1">
      <t>Walter Muschg Freud</t>
      <t class="Ticcl">Walter musch Freud</t>
      <s xml:id="bug.p.1.s.1">
        <t>Walter Muschg Freud</t>
        <t class="Ticcl">Walter musch Freud</t>
        <w xml:id="bug.p.1.s.1.w.1" class="WORD" processor="ucto.1">
          <t>Walter</t>
          <t class="Ticcl" offset="0">Walter</t>
        </w>
        <correction xml:id="bug.p.1.s.1.correction.1">
          <new>
            <w xml:id="bug.p.1.s.1.w.2.edit.1" processor="FoLiA-correct.1">
              <t class="Ticcl" offset="7">musch</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="bug.p.1.s.1.w.2" class="WORD" processor="ucto.1">
              <t>Muschg</t>
            </w>
          </original>
        </correction>
        <w xml:id="bug.p.1.s.1.w.3" class="WORD" processor="ucto.1">
          <t>Freud</t>
          <t class="Ticcl" offset="13">Freud</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

& folialint --nooutput bug.ticcl.folia.xml
Validated successfully: bug.ticcl.folia.xml

foliavalidator bug.ticcl.folia.xml
VALIDATION ERROR on full parse by library (stage 2/3), in bug.ticcl.folia.xml
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent @ parent line 34) : [InconsistentText] Text for
<Sentence at 140336671748544 id=bug.p.1.s.1 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Walter Freud
****> BUT FOUND (strict text after normalization) ****>
Walter Muschg Freud
******* DEVIATION POINT: Walter <HERE>Muschg Fre
(also checked against older rules prior to FoLiA v2.4.1)

@proycon I remember that issues like this have been discussed before. like
in proycon/folia#98
and proycon/folia#75

But the argument has not been settled, it seems. And I agree that it is a difficult problem.

@kosloot
Copy link
Contributor Author

kosloot commented Oct 8, 2022

note also related to: proycon/folia#100 which is deemed Low Priority unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants