-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve lemmatisation #63
Comments
Some remarks: |
Thanks! I'm not very familliar with frog myself, so I did not know that it could be retrained. This may be the best way forward for us, what do you think, @oktaal ? |
Definitely. Thank you! We are going to collect a new list for Frog. You will see it some time in the future @kosloot |
looking forward.In the meantime I improved on froggen a bit, to make everybody's life mor comfortable. NB: the POS tags must be from the CGN set. |
@kosloot hi! It's been a bit over a year, but here we finally have a list: https://github.com/CentreForDigitalHumanities/dutch-plurals/blob/main/output.tsv (I also have some time to look into it now again, it's been sitting there for a while). I'm not quite sure if I also need to add rows just containing singulars, so e.g. it only has rows like this:
I could easily add rows such as:
if needed. Words which cannot be pluralized such as "adrenaline", "marmer", "toedoen", etc only have a row such as:
I'm not sure if these need to be marked in some other way. |
Thanks. I hope to be able to look into this "rsn".
Frog is already trained with:
so your entry
So maybe it is wise to reconsider this list a bit |
Additional remark:;
But multiword entries are NOT supported, so this entry will be skipped |
It's been awhile again, but I've updated the list to no longer have multiword entries (there were just two) and modified the tags to be compliant with VanEynde's format. I wonder if I also need to add word gender? |
Ok, we are getting close.
That's all folks |
Ok, it is even more complex then I thought. I took a better look at the second case of And others are to seen as an To cite vanEynde:
(page 20 of my copy) BUT: some of these cases are ambiguous, we will also need: This is quite clumsy, but that's the historic way |
There are some noticable inaccuracies in the output from the frog lemmatiser (such as
*heden
not being lemmatised to*heid
), perhaps we can improve the lemmatisation.One option is to add a different lemmatisation service that can be used instead of frog. We should investigate if there is a lemmatiser for Dutch with significantly better results.
Another option is to use some combination of the frog and alpino output for the final lemmatisation. Suggestion from @oktaal
The text was updated successfully, but these errors were encountered: