Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ucto with 'detectlanguages' : failure #87

Closed
martinreynaert opened this issue Mar 29, 2022 · 3 comments
Closed

Ucto with 'detectlanguages' : failure #87

martinreynaert opened this issue Mar 29, 2022 · 3 comments
Assignees

Comments

@martinreynaert
Copy link

Hi,

Running this command I get the following result:

(LMdev) reynaert@violet:MARXENGELS$ ucto --textredundancy=full --detectlanguages=eng,fra,deu,ita,spa MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml > MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_3.folia.xml 2>>FoliaUcto.20220329.stderr

(LMdev) reynaert@violet:MARXENGELS$ cat FoliaUcto.20220329.stderr
ucto: inputfile = MARXENGELS-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19-V0_2.folia.xml
ucto: outputfile =
ucto: textcat configured from: /home/reynaert/LMdev/share/ucto/textcat.cfg
ucto: configured for languages: [eng,fra,deu,ita,spa]
ucto:ucto: --filter=NO is automatically set. inputclass equals outputclass!
ucto: inconsistent text: adding text (class=current) from node: t() with value
'Leporello — a character from Mozat's opera Don Giovanni: Don Juan's servant.'
to element: p(MarxEngels-A-2003_19-The_Collected_Works_of_Karl_Marx_and_Frederick_Engels_General_Works_1844_1895_Volume_19.text.div.3.div.1.div.2.div.2.div.25.p.10) which already has text in that class and value:
'Leporello — a character from Mozart's opera Don Giovanni: Don Juan's servant. — 43'

Please note the loss of a single character in "Mozat's". I seem to have seen this error before.

I have 47 files in this batch. This happened for 25 of them.

This does not occur when I run Ucto with '-L eng'. I will however want to have language recognition on this corpus.

I attach the input file.

TEST.zip

Thank you!

@kosloot
Copy link
Contributor

kosloot commented Jun 12, 2022

Ok, I examined this problem, and it boils down to a problem in the file tokconfig-ita in the uctodata package.
The string Mozart's is split by the PREFIX rule into: 4 parts:

Moza
<none>
t'
s

This while t' IS a prefix. But I don't think that is true when directly preceded by Mozar
So I assume some trouble here.

This config-file for Italian is created by @proycon so I hope he is able to fix this.

NOTE: I expect the same kind of problems with tokconfig-fra

kosloot added a commit to LanguageMachines/uctodata that referenced this issue Jun 12, 2022
@kosloot
Copy link
Contributor

kosloot commented Jun 12, 2022

I updated the PREFIX rule in tokconfig-ita to avoid loosing the r in Mozart's.
The inconsistent text error is gone now.
But the tokenization is still wrong, while the PREFIX rule still matches, which is incorrect

@kosloot
Copy link
Contributor

kosloot commented Jun 13, 2022

Ok, after consulting @proycon I implemented a simple fix for Italian and French.
It seems to work OK, but we don't have a lot of tests

@kosloot kosloot closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants