Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words that match more than one lemma #94

Open
juanjoDiaz opened this issue May 26, 2023 · 5 comments
Open

Words that match more than one lemma #94

juanjoDiaz opened this issue May 26, 2023 · 5 comments
Labels
question Further information is requested

Comments

@juanjoDiaz
Copy link
Collaborator

Hi,

I noticed a problem with the behaviour for the word 'schulen' in German (as a noun vs as a verb) when using all capitals:

>>> simplemma.lemmatize("Schulen", "de")
'Schule'  # plural form lemmatized as noun "Schule"
>>> simplemma.lemmatize("schulen", "de")
'schulen'  # infinitive form lemmatized as verb "schulen"

The all-caps version appears to result in a match with the lowercase :

>>> simplemma.lemmatize("SCHULEN", "de")
'schulen'

Of course, it is impossible to know which version to prefer in the case.

I noticed that some of the words in the dictionaries contain more than one lemma.
for example

>>> simplemma.lemmatize("Sie", "de")
'Sie|sie'

Why are there words with multiple lemmas in the dictionaries?
Should we consider adding this on simplemma side? I mean changing the strategies so they can have multiple matches and return them somehow.

Wdyt?

@adbar
Copy link
Owner

adbar commented May 26, 2023

Tough one, this is an absolute borderline case since multiple matches are usually not present in lists and they may be annotated differently.

Concerning the "noun vs. verb" issue this is indeed one of the main limitations of simplemma, it does not operate with syntactic information.

@adbar adbar added the question Further information is requested label Jun 1, 2023
@1over137
Copy link
Contributor

Personally, I think the best way to handle this is to actually capture all examples (Schulen -> [Schule, schulen]) present in the corpus, arrange them by frequency, and make a new API (let's call it simplemma.lemmatize_all), which returns a list rather than a single word. This doesn't require drastically complicating the architecture but would still be useful for many situations.

@adbar
Copy link
Owner

adbar commented Mar 25, 2024

The approach you suggest would probably give better results but memory is already a concern for the available dictionaries. One way or the other there is always a tradeoff between precision, memory and processing time.

@juanjoDiaz
Copy link
Collaborator Author

Regarding the issue of having multiple words separated by |, it only happens in German and only in a few words:

  • ('Sie', 'Sie|sie')
  • ('Sich', 'er|es|sie')
  • ('er|es|sie', 'er|es|sie')
  • ('sich', 'er|es|sie')

So, I think that we should just modify the training script to correct these.
Unfortunately, that script is not public yet (requested in #102) so I can't do a PR.

Regarding the proposal of having a list of potential lemma instead of a single lemma, it was just what I proposed in the original issue. I guess that it's a matter of having both options so the user can control memory.
This can be easily done with the strategies framework that I did.
Once again, if you could publish here how dictionaries are trained, I'm happy to give it a go and give you some numbers and a proposal in the form of a PR.

@1over137
Copy link
Contributor

1over137 commented Mar 25, 2024

It would be great to publish the dictionary creation scripts. In particular, I feel that it would be nice to augment the existing data by passing some corpus through an LLM which should be at least decent at the job. My users (https://github.com/FreeLanguageTools/vocabsieve) often complain that the lemmatizer coverage is poor for some languages, severely hindering usability. I know this isn't very elegant and may increase disk space use, but sometimes practicality is more important. A good first step in this direction would be to produce better eval datasets, though. I think the current eval datasets are too small to be representative, as they contain quite few unique lemmas. For this primarily dictionary-based lemmatization, I don't think there is a real need to separate training and validation sets, because you are primarily just memorizing stuff anyways, not generalizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants