You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have no idea why TRAINING is in the dictionary but INSTRUCTIONS is not.
I don't think that TRAINING should be.
And we might want to adjust the dictionary lookup strategy to try full lowercasing the word.
I can do a PR since the change is easier.
But I have no idea of how you test if changes in the strategies improve or worsen the accuracy of simplemma.
It would be good to get that documented so anyone working on the library can run the tests.
Wdyt?
The text was updated successfully, but these errors were encountered:
Hi @juanjoDiaz, thanks for the feedback, that's odd indeed.
Words written in all caps currently remain untouched in case they are acronyms (e.g. BRICS). That being said it is safe to say that a token of len > x is most probably not an acronym and can be lower-cased if the language is in BETTER_LOWER. For English long acronyms are rare, we need to decide on a length limit, I'd say 6 ot 7: https://en.wiktionary.org/wiki/Category:English_acronyms
HI @adbar ,
Using simplemma my team has found multiple odd cases related to capitalization.
When words are fully capitalized, lemmatization doesn't seem correct
I have no idea why
TRAINING
is in the dictionary butINSTRUCTIONS
is not.I don't think that TRAINING should be.
And we might want to adjust the dictionary lookup strategy to try full lowercasing the word.
I can do a PR since the change is easier.
But I have no idea of how you test if changes in the strategies improve or worsen the accuracy of simplemma.
It would be good to get that documented so anyone working on the library can run the tests.
Wdyt?
The text was updated successfully, but these errors were encountered: