You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(LMdev) reynaert@violet:MARXENGELS$ cat FoliaUcto.20220329.stderr
ucto: inputfile = MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml
ucto: outputfile =
ucto: textcat configured from: /home/reynaert/LMdev/share/ucto/textcat.cfg
ucto: configured for languages: [eng,fra,deu,ita,spa]
ucto:ucto: --filter=NO is automatically set. inputclass equals outputclass!
ucto: inconsistent text: adding text (class=current) from node: t() with value
'Leporello — a character from Mozat's opera Don Giovanni: Don Juan's servant.'
to element: p(MarxEngels-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19.text.div.3.div.1.div.2.div.2.div.25.p.10) which already has text in that class and value:
'Leporello — a character from Mozart's opera Don Giovanni: Don Juan's servant. — 43'
Please note the loss of a single character in "Mozat's". I seem to have seen this error before.
I have 47 files in this batch. This happened for 25 of them.
This does not occur when I run Ucto with '-L eng'. I will however want to have language recognition on this corpus.
Ok, I examined this problem, and it boils down to a problem in the file tokconfig-ita in the uctodata package.
The string Mozart's is split by the PREFIX rule into: 4 parts:
Moza
<none>
t'
s
This while t' IS a prefix. But I don't think that is true when directly preceded by Mozar
So I assume some trouble here.
This config-file for Italian is created by @proycon so I hope he is able to fix this.
NOTE: I expect the same kind of problems with tokconfig-fra
kosloot
added a commit
to LanguageMachines/uctodata
that referenced
this issue
Jun 12, 2022
I updated the PREFIX rule in tokconfig-ita to avoid loosing the r in Mozart's.
The inconsistent text error is gone now.
But the tokenization is still wrong, while the PREFIX rule still matches, which is incorrect
Hi,
Running this command I get the following result:
(LMdev) reynaert@violet:MARXENGELS$ ucto --textredundancy=full --detectlanguages=eng,fra,deu,ita,spa MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml > MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_3.folia.xml 2>>FoliaUcto.20220329.stderr
(LMdev) reynaert@violet:MARXENGELS$ cat FoliaUcto.20220329.stderr
ucto: inputfile = MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml
ucto: outputfile =
ucto: textcat configured from: /home/reynaert/LMdev/share/ucto/textcat.cfg
ucto: configured for languages: [eng,fra,deu,ita,spa]
ucto:ucto: --filter=NO is automatically set. inputclass equals outputclass!
ucto: inconsistent text: adding text (class=current) from node: t() with value
'Leporello — a character from Mozat's opera Don Giovanni: Don Juan's servant.'
to element: p(MarxEngels-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19.text.div.3.div.1.div.2.div.2.div.25.p.10) which already has text in that class and value:
'Leporello — a character from Mozart's opera Don Giovanni: Don Juan's servant. — 43'
Please note the loss of a single character in "Mozat's". I seem to have seen this error before.
I have 47 files in this batch. This happened for 25 of them.
This does not occur when I run Ucto with '-L eng'. I will however want to have language recognition on this corpus.
I attach the input file.
TEST.zip
Thank you!
The text was updated successfully, but these errors were encountered: