Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve lemmatisation #63

Open
lukavdplas opened this issue Nov 11, 2022 · 10 comments
Open

improve lemmatisation #63

lukavdplas opened this issue Nov 11, 2022 · 10 comments

Comments

@lukavdplas
Copy link
Contributor

There are some noticable inaccuracies in the output from the frog lemmatiser (such as *heden not being lemmatised to *heid), perhaps we can improve the lemmatisation.

One option is to add a different lemmatisation service that can be used instead of frog. We should investigate if there is a lemmatiser for Dutch with significantly better results.

Another option is to use some combination of the frog and alpino output for the final lemmatisation. Suggestion from @oktaal

Interessant genoeg lijkt de Alpino-parse in dit geval wel "gedweeheid" als lemma te vinden maar die informatie wordt niet gebruikt in T-Scan.

Wat ik zou kunnen doen is het lemma-attribuut van de Alpino-parse te gebruiken als (1) het lemma van frog hetzelfde is als het woord (dus geen lemmatisering) en (2) het lemma van Alpino wel afwijkt. Als beide een lemma hebben dat afwijkt van het woord dan is Frog leidend. Als Frog correct ziet dat het lemma hetzelfde is als het woord en Alpino er toch wat anders van heeft gemaakt dan introduceert dat dan wel een fout.

Ik vraag me af in hoeverre hier nieuwe problemen kunnen ontstaan, idealiter zou je dit willen kunnen evalueren. Misschien moet dit een optie worden (lemma-informatie: alleen Frog (nu het geval), alleen Alpino, Frog met Alpino-fallback, Alpino met Frog-fallback).

@kosloot
Copy link
Collaborator

kosloot commented Nov 14, 2022

Some remarks:
We all know Frog isn't perfect, but it already knows about 880 different *heden to *heid lemma's, which isn't that bad.
But the Frog lemmatizer does miss some (not all) versions outside that 880
It isn't that hard to train extra lemma's into frogs datafiles. All you need is a list of
word <tab> lemma <tab> POStag
cases.
In that way you could improve rather quickly imho,
Background: The lemmatizer is trained on sentences from CGN with additions from an extra list of know lemma's
which can easily be expanded. With other missing lemmas too.
Training can be done using froggen. Not a very difficult task, once used to it :P
If you have a list in the right format, I am willing to do this task, IFF we may use this data to add to the Frog project as a whole.
(see also the froggen manual )

@lukavdplas
Copy link
Contributor Author

Thanks! I'm not very familliar with frog myself, so I did not know that it could be retrained. This may be the best way forward for us, what do you think, @oktaal ?

@oktaal
Copy link
Contributor

oktaal commented Nov 29, 2022

Definitely. Thank you! We are going to collect a new list for Frog. You will see it some time in the future @kosloot

@kosloot
Copy link
Collaborator

kosloot commented Nov 29, 2022

looking forward.In the meantime I improved on froggen a bit, to make everybody's life mor comfortable.

NB: the POS tags must be from the CGN set.

@oktaal oktaal transferred this issue from oktaal/tscan Dec 13, 2022
@oktaal oktaal transferred this issue from another repository Dec 13, 2022
@oktaal
Copy link
Contributor

oktaal commented Jan 11, 2024

@kosloot hi! It's been a bit over a year, but here we finally have a list: https://github.com/CentreForDigitalHumanities/dutch-plurals/blob/main/output.tsv (I also have some time to look into it now again, it's been sitting there for a while). I'm not quite sure if I also need to add rows just containing singulars, so e.g. it only has rows like this:

EK's	EK	N(eigen,meervoud,basis)

I could easily add rows such as:

EK	EK	N(eigen,enkelvoud,basis)

if needed.

Words which cannot be pluralized such as "adrenaline", "marmer", "toedoen", etc only have a row such as:

adrenaline	adrenaline	N(basis,enkelvoud,basis)

I'm not sure if these need to be marked in some other way.

@kosloot
Copy link
Collaborator

kosloot commented Jan 17, 2024

Thanks. I hope to be able to look into this "rsn".
Some remarks after skimming the data:

werkenelektro-encefalograaf N(basis,meervoud,basis)

Frog is already trained with:

elektro-encefalografen elektro-encefalograaf N(soort,mv,basis)

so your entry

  1. will loose the 'soort' tag,
  2. used 'basis' 2 times, which might make the software barf (not tested)
  3. You use 'meervoud' where CGN uses 'mv' (based on the CGN tags as defined by VanEynde. 2004)

So maybe it is wise to reconsider this list a bit

@kosloot
Copy link
Collaborator

kosloot commented Jan 17, 2024

Additional remark:;
One of the lines reads:

Puerto Rico Puerto Rico N(eigen,enkelvoud,basis)

But multiword entries are NOT supported, so this entry will be skipped

@oktaal
Copy link
Contributor

oktaal commented Mar 1, 2024

It's been awhile again, but I've updated the list to no longer have multiword entries (there were just two) and modified the tags to be compliant with VanEynde's format. I wonder if I also need to add word gender?

@kosloot
Copy link
Collaborator

kosloot commented Mar 2, 2024

Ok, we are getting close.
But there are still a few problems

  1. a lot of words are tagged as: N(soort,ev,basis), but vanEijnde uses a more fine-grained:

    • [T101] N(soort,ev,basis,zijd,stan) die stoel, deze muziek, de filter
    • [T102] N(soort,ev,basis,onz,stan) het kind, ons huis, het filter
    • [T104] N(soort,ev,basis,gen) 's avonds, de heer des huizes
    • [T106] N(soort,ev,basis,dat) ter plaatse, heden ten dage
    • [U117] N(soort,ev,basis,genus,stan) een riool, geen filter
      I could probably modify Frog to do 'fuzzy matching' where N(soort,ev,basis) matches N(soort,ev,basis,onz,stan)
      but that is a lot of work, and might have an unknown impact.
  2. a lot of proper names are tagged as: N(eigen,ev,basis)
    here too, vanEijnde is more specific

    • [T109] N(eigen,ev,basis,zijd,stan) de Noordzee, de Kemmelberg, Karel
    • [T110] N(eigen,ev,basis,onz,stan) het Hageland, het Nederlands
    • [T112] N(eigen,ev,basis,gen) des Heren, Hagelands trots
    • [T114] N(eigen,ev,basis,dat) wat den Here toekomt
    • [U118] N(eigen,ev,basis,genus,stan) Linux, Esselte

    BUT, the Frog tagger is trained on data where all proper names are tagged as SPEC(deeleigen)
    so this is a nice shortcut I would advice.
    To be clear: Whenever the Tagger tags a word as SPEC(deeleigen) the lemmatizer will take a shortcut,
    and will use the word as the lemma. Effectively ignoring the lemma assigned in de lemmata data.

  3. one entry is ietsje ietsje N(soort,ev,dim)'
    this tag is also not known, but probably N(soort,ev,dim,onz,stan) will do?

That's all folks

@kosloot
Copy link
Collaborator

kosloot commented Mar 9, 2024

Ok, it is even more complex then I thought. I took a better look at the second case of N(eigen,mv,basis) words
And not ALL of them should be (exclusively) tagged a SPEC(deeleigen).
There is also a range of N(soort,mv,basis) tags among them. Words like Zuid-Molukkers and Zwitsers

And others are to seen as an ADJ. not as an N.
This are cases like
Zuid-Nederlandse Zuid-Nederlands ADJ(prenom,basis,met-e,stan) ,
Zuid-Amerikaans Zuid-Amerikaaans ADJ(prenom,basis,zonder)
Zuid-Amerikaanse Zuid-Amerikaans ADJ(prenom,basis,met-e,stan)

To cite vanEynde:

Nominaal (of zelfstandig) gebruikte adjectieven worden niet als substantie- ven behandeld, maar als adjectieven.

(page 20 of my copy)

BUT: some of these cases are ambiguous, we will also need:
Zuid-Amerikaanse Zuid-Amerikaanse SPEC(deeleigen) or
Zuid-Chinese Zuid-Chinese SPEC(deeleigen)

This is quite clumsy, but that's the historic way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants