improve lemmatisation #63

lukavdplas · 2022-11-11T10:48:03Z

There are some noticable inaccuracies in the output from the frog lemmatiser (such as *heden not being lemmatised to *heid), perhaps we can improve the lemmatisation.

One option is to add a different lemmatisation service that can be used instead of frog. We should investigate if there is a lemmatiser for Dutch with significantly better results.

Another option is to use some combination of the frog and alpino output for the final lemmatisation. Suggestion from @oktaal

Interessant genoeg lijkt de Alpino-parse in dit geval wel "gedweeheid" als lemma te vinden maar die informatie wordt niet gebruikt in T-Scan.

Wat ik zou kunnen doen is het lemma-attribuut van de Alpino-parse te gebruiken als (1) het lemma van frog hetzelfde is als het woord (dus geen lemmatisering) en (2) het lemma van Alpino wel afwijkt. Als beide een lemma hebben dat afwijkt van het woord dan is Frog leidend. Als Frog correct ziet dat het lemma hetzelfde is als het woord en Alpino er toch wat anders van heeft gemaakt dan introduceert dat dan wel een fout.

Ik vraag me af in hoeverre hier nieuwe problemen kunnen ontstaan, idealiter zou je dit willen kunnen evalueren. Misschien moet dit een optie worden (lemma-informatie: alleen Frog (nu het geval), alleen Alpino, Frog met Alpino-fallback, Alpino met Frog-fallback).

The text was updated successfully, but these errors were encountered:

kosloot · 2022-11-14T18:39:06Z

Some remarks:
We all know Frog isn't perfect, but it already knows about 880 different *heden to *heid lemma's, which isn't that bad.
But the Frog lemmatizer does miss some (not all) versions outside that 880
It isn't that hard to train extra lemma's into frogs datafiles. All you need is a list of
word <tab> lemma <tab> POStag
cases.
In that way you could improve rather quickly imho,
Background: The lemmatizer is trained on sentences from CGN with additions from an extra list of know lemma's
which can easily be expanded. With other missing lemmas too.
Training can be done using froggen. Not a very difficult task, once used to it :P
If you have a list in the right format, I am willing to do this task, IFF we may use this data to add to the Frog project as a whole.
(see also the froggen manual )

lukavdplas · 2022-11-15T09:09:21Z

Thanks! I'm not very familliar with frog myself, so I did not know that it could be retrained. This may be the best way forward for us, what do you think, @oktaal ?

oktaal · 2022-11-29T15:40:25Z

Definitely. Thank you! We are going to collect a new list for Frog. You will see it some time in the future @kosloot

kosloot · 2022-11-29T18:57:46Z

looking forward.In the meantime I improved on froggen a bit, to make everybody's life mor comfortable.

NB: the POS tags must be from the CGN set.

oktaal · 2024-01-11T13:48:09Z

@kosloot hi! It's been a bit over a year, but here we finally have a list: https://github.com/CentreForDigitalHumanities/dutch-plurals/blob/main/output.tsv (I also have some time to look into it now again, it's been sitting there for a while). I'm not quite sure if I also need to add rows just containing singulars, so e.g. it only has rows like this:

EK's	EK	N(eigen,meervoud,basis)

I could easily add rows such as:

EK	EK	N(eigen,enkelvoud,basis)

if needed.

Words which cannot be pluralized such as "adrenaline", "marmer", "toedoen", etc only have a row such as:

adrenaline	adrenaline	N(basis,enkelvoud,basis)

I'm not sure if these need to be marked in some other way.

kosloot · 2024-01-17T08:14:24Z

Thanks. I hope to be able to look into this "rsn".
Some remarks after skimming the data:

werkenelektro-encefalograaf N(basis,meervoud,basis)

Frog is already trained with:

elektro-encefalografen elektro-encefalograaf N(soort,mv,basis)

so your entry

will loose the 'soort' tag,
used 'basis' 2 times, which might make the software barf (not tested)
You use 'meervoud' where CGN uses 'mv' (based on the CGN tags as defined by VanEynde. 2004)

So maybe it is wise to reconsider this list a bit

kosloot · 2024-01-17T13:35:25Z

Additional remark:;
One of the lines reads:

Puerto Rico Puerto Rico N(eigen,enkelvoud,basis)

But multiword entries are NOT supported, so this entry will be skipped

oktaal · 2024-03-01T12:34:19Z

It's been awhile again, but I've updated the list to no longer have multiword entries (there were just two) and modified the tags to be compliant with VanEynde's format. I wonder if I also need to add word gender?

kosloot · 2024-03-02T10:23:22Z

Ok, we are getting close.
But there are still a few problems

a lot of words are tagged as: N(soort,ev,basis), but vanEijnde uses a more fine-grained:
- [T101] N(soort,ev,basis,zijd,stan) die stoel, deze muziek, de filter
- [T102] N(soort,ev,basis,onz,stan) het kind, ons huis, het filter
- [T104] N(soort,ev,basis,gen) 's avonds, de heer des huizes
- [T106] N(soort,ev,basis,dat) ter plaatse, heden ten dage
- [U117] N(soort,ev,basis,genus,stan) een riool, geen filter
  I could probably modify Frog to do 'fuzzy matching' where N(soort,ev,basis) matches N(soort,ev,basis,onz,stan)
  but that is a lot of work, and might have an unknown impact.
a lot of proper names are tagged as: N(eigen,ev,basis)
here too, vanEijnde is more specific
- [T109] N(eigen,ev,basis,zijd,stan) de Noordzee, de Kemmelberg, Karel
- [T110] N(eigen,ev,basis,onz,stan) het Hageland, het Nederlands
- [T112] N(eigen,ev,basis,gen) des Heren, Hagelands trots
- [T114] N(eigen,ev,basis,dat) wat den Here toekomt
- [U118] N(eigen,ev,basis,genus,stan) Linux, Esselte
BUT, the Frog tagger is trained on data where all proper names are tagged as SPEC(deeleigen)
so this is a nice shortcut I would advice.
To be clear: Whenever the Tagger tags a word as SPEC(deeleigen) the lemmatizer will take a shortcut,
and will use the word as the lemma. Effectively ignoring the lemma assigned in de lemmata data.
one entry is ietsje ietsje N(soort,ev,dim)'
this tag is also not known, but probably N(soort,ev,dim,onz,stan) will do?

That's all folks

kosloot · 2024-03-09T10:24:04Z

Ok, it is even more complex then I thought. I took a better look at the second case of N(eigen,mv,basis) words
And not ALL of them should be (exclusively) tagged a SPEC(deeleigen).
There is also a range of N(soort,mv,basis) tags among them. Words like Zuid-Molukkers and Zwitsers

And others are to seen as an ADJ. not as an N.
This are cases like
Zuid-Nederlandse Zuid-Nederlands ADJ(prenom,basis,met-e,stan) ,
Zuid-Amerikaans Zuid-Amerikaaans ADJ(prenom,basis,zonder)
Zuid-Amerikaanse Zuid-Amerikaans ADJ(prenom,basis,met-e,stan)

To cite vanEynde:

Nominaal (of zelfstandig) gebruikte adjectieven worden niet als substantie- ven behandeld, maar als adjectieven.

(page 20 of my copy)

BUT: some of these cases are ambiguous, we will also need:
Zuid-Amerikaanse Zuid-Amerikaanse SPEC(deeleigen) or
Zuid-Chinese Zuid-Chinese SPEC(deeleigen)

This is quite clumsy, but that's the historic way

oktaal transferred this issue from oktaal/tscan Dec 13, 2022

oktaal transferred this issue from another repository Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve lemmatisation #63

improve lemmatisation #63

lukavdplas commented Nov 11, 2022

kosloot commented Nov 14, 2022 •

edited

Loading

lukavdplas commented Nov 15, 2022

oktaal commented Nov 29, 2022

kosloot commented Nov 29, 2022

oktaal commented Jan 11, 2024

kosloot commented Jan 17, 2024

kosloot commented Jan 17, 2024 •

edited

Loading

oktaal commented Mar 1, 2024

kosloot commented Mar 2, 2024 •

edited

Loading

kosloot commented Mar 9, 2024 •

edited

Loading

improve lemmatisation #63

improve lemmatisation #63

Comments

lukavdplas commented Nov 11, 2022

kosloot commented Nov 14, 2022 • edited Loading

lukavdplas commented Nov 15, 2022

oktaal commented Nov 29, 2022

kosloot commented Nov 29, 2022

oktaal commented Jan 11, 2024

kosloot commented Jan 17, 2024

kosloot commented Jan 17, 2024 • edited Loading

oktaal commented Mar 1, 2024

kosloot commented Mar 2, 2024 • edited Loading

kosloot commented Mar 9, 2024 • edited Loading

kosloot commented Nov 14, 2022 •

edited

Loading

kosloot commented Jan 17, 2024 •

edited

Loading

kosloot commented Mar 2, 2024 •

edited

Loading

kosloot commented Mar 9, 2024 •

edited

Loading