-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect extraction of deep text from a document with corrections #49
Comments
I indeed get something similar with foliapy:
|
The foliapy error is correct though:
Seems a bit different from the error in libfolia. |
Ok, my bad. |
A somewhat shaky solution is committed now. Needs testing |
This fix enables @martinreynaert to run his corrections, but also AGAIN shows a difference of opinions between libfolia and FoLiaPY. Running FoLiA-correct on only the first part of the title of the text file already reveals this, The produced folia is rejected by The test file: <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bug" generator="libfolia-v2.12" version="2.5.1">
<metadata type="native">
<annotations>
<token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
<annotator processor="FoLiA-correct.1"/>
<annotator processor="ucto.1"/>
</token-annotation>
<paragraph-annotation>
<annotator processor="ucto.1"/>
</paragraph-annotation>
<sentence-annotation>
<annotator processor="ucto.1"/>
</sentence-annotation>
<text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
<correction-annotation set="Ticcl-set">
<annotator processor="FoLiA-correct.1"/>
</correction-annotation>
</annotations>
<provenance>
<processor xml:id="ucto.1" begindatetime="2022-10-08T08:48:33" command="ucto -X -L deu --textredundancy=full --id bug bug.txt bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
<processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
<processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
<processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
</processor>
</processor>
<processor xml:id="FoLiA-correct.1" begindatetime="2022-10-08T08:49:07" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
<processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
</processor>
</provenance>
<meta id="language">deu</meta>
</metadata>
<text xml:id="bug.text">
<p xml:id="bug.p.1">
<t>Walter Muschg Freud</t>
<t class="Ticcl">Walter musch Freud</t>
<s xml:id="bug.p.1.s.1">
<t>Walter Muschg Freud</t>
<t class="Ticcl">Walter musch Freud</t>
<w xml:id="bug.p.1.s.1.w.1" class="WORD" processor="ucto.1">
<t>Walter</t>
<t class="Ticcl" offset="0">Walter</t>
</w>
<correction xml:id="bug.p.1.s.1.correction.1">
<new>
<w xml:id="bug.p.1.s.1.w.2.edit.1" processor="FoLiA-correct.1">
<t class="Ticcl" offset="7">musch</t>
</w>
</new>
<original auth="no">
<w xml:id="bug.p.1.s.1.w.2" class="WORD" processor="ucto.1">
<t>Muschg</t>
</w>
</original>
</correction>
<w xml:id="bug.p.1.s.1.w.3" class="WORD" processor="ucto.1">
<t>Freud</t>
<t class="Ticcl" offset="13">Freud</t>
</w>
</s>
</p>
</text>
</FoLiA>
@proycon I remember that issues like this have been discussed before. like But the argument has not been settled, it seems. And I agree that it is a difficult problem. |
note also related to: proycon/folia#100 which is deemed Low Priority unfortunately |
the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence.
This came up in: LanguageMachines/foliautils#66
When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together.
Example (rather braindead, but is proves the point)
When parsing this file, withe folialint:
The text was updated successfully, but these errors were encountered: