Ucto with 'detectlanguages' : failure #87

martinreynaert · 2022-03-29T19:03:50Z

Hi,

Running this command I get the following result:

(LMdev) reynaert@violet:MARXENGELS$ ucto --textredundancy=full --detectlanguages=eng,fra,deu,ita,spa MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml > MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_3.folia.xml 2>>FoliaUcto.20220329.stderr

(LMdev) reynaert@violet:MARXENGELS$ cat FoliaUcto.20220329.stderr
ucto: inputfile = MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml
ucto: outputfile =
ucto: textcat configured from: /home/reynaert/LMdev/share/ucto/textcat.cfg
ucto: configured for languages: [eng,fra,deu,ita,spa]
ucto:ucto: --filter=NO is automatically set. inputclass equals outputclass!
ucto: inconsistent text: adding text (class=current) from node: t() with value
'Leporello — a character from Mozat's opera Don Giovanni: Don Juan's servant.'
to element: p(MarxEngels-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19.text.div.3.div.1.div.2.div.2.div.25.p.10) which already has text in that class and value:
'Leporello — a character from Mozart's opera Don Giovanni: Don Juan's servant. — 43'

Please note the loss of a single character in "Mozat's". I seem to have seen this error before.

I have 47 files in this batch. This happened for 25 of them.

This does not occur when I run Ucto with '-L eng'. I will however want to have language recognition on this corpus.

I attach the input file.

TEST.zip

Thank you!

kosloot · 2022-06-12T19:27:33Z

Ok, I examined this problem, and it boils down to a problem in the file tokconfig-ita in the uctodata package.
The string Mozart's is split by the PREFIX rule into: 4 parts:

Moza
<none>
t'
s

This while t' IS a prefix. But I don't think that is true when directly preceded by Mozar
So I assume some trouble here.

This config-file for Italian is created by @proycon so I hope he is able to fix this.

NOTE: I expect the same kind of problems with tokconfig-fra

…guageMachines/ucto#87 ) This doesn't fix the unwanted splits though.

kosloot · 2022-06-12T22:07:07Z

I updated the PREFIX rule in tokconfig-ita to avoid loosing the r in Mozart's.
The inconsistent text error is gone now.
But the tokenization is still wrong, while the PREFIX rule still matches, which is incorrect

kosloot · 2022-06-13T09:51:14Z

Ok, after consulting @proycon I implemented a simple fix for Italian and French.
It seems to work OK, but we don't have a lot of tests

martinreynaert assigned proycon and kosloot Mar 30, 2022

kosloot added a commit to LanguageMachines/uctodata that referenced this issue Jun 12, 2022

small fix to prevent loosing a character in the PREFIX rule. (see Lan…

b1ba20d

…guageMachines/ucto#87 ) This doesn't fix the unwanted splits though.

kosloot closed this as completed Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ucto with 'detectlanguages' : failure #87

Ucto with 'detectlanguages' : failure #87

martinreynaert commented Mar 29, 2022

kosloot commented Jun 12, 2022

kosloot commented Jun 12, 2022

kosloot commented Jun 13, 2022

Ucto with 'detectlanguages' : failure #87

Ucto with 'detectlanguages' : failure #87

Comments

martinreynaert commented Mar 29, 2022

kosloot commented Jun 12, 2022

kosloot commented Jun 12, 2022

kosloot commented Jun 13, 2022