Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Northern Sami language #17

Closed
osma opened this issue Sep 1, 2022 · 12 comments
Closed

Support for Northern Sami language #17

osma opened this issue Sep 1, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@osma
Copy link
Contributor

osma commented Sep 1, 2022

I propose that Simplemma could support the Northern Sami language (ISO 639-1 code se). I understood from this discussion that adding a new language would require a corpus of word + lemma pairs. My colleague @nikopartanen found at least these two corpora that could perhaps be used as raw material:

SIKOR North Saami free corpus - this is a relatively large (9M tokens) corpus. From the description:

The corpus has been automatically processed and linguistically analyzed with the Giellatekno/Divvun tools. Therefore, it may contain wrong annotations.

Another obvious corpus is the Universal Dependency treebank for Northern Sami.

Thoughts on these two corpora? What needs to be done to make this happen?

@adbar
Copy link
Owner

adbar commented Sep 1, 2022

Hi again, thanks for the suggestions!

I could derive lemmatization data from the universal dependency treebank. The first corpus doesn't look as good as it could lead to wrong word pairs due to unsupervised annotation.

As I said before, concerning Sami I could use data from the Kaikki project but for now there are only about 5,000 words available. It could get better in the future but is not that usable for now.
Ideally I could combine the two resources and see if it leads anywhere.

@adbar
Copy link
Owner

adbar commented Sep 1, 2022

It's unclear to what extent the UD corpus has been manually corrected now that I further look at the description. There could be mistakes there as well, so SIKOR is probably also usable.

I lack the expertise to evaluate these resources on qualitative level, do you have any thoughts to share on the quality of the word/lemma pairs in these resources?

@nikopartanen
Copy link

nikopartanen commented Sep 1, 2022

I can comment that as far as I know the UD corpus should be manually corrected. I think it was converted to the UD format from something else, in which the correction was probably already done.

@adbar
Copy link
Owner

adbar commented Sep 1, 2022

I get much more word pairs from all inflected forms in Kaikki than from UD (although the UD forms should be more frequent). I'll try to integrate the data soon.

@adbar
Copy link
Owner

adbar commented Sep 5, 2022

It it now added (version 0.8.2, language code se), I used the opportunity to add a few other languages as well ✔️

The linguistic material I used to build the word pairs looks good but it is untested, so I'll leave the thread open. Feel free to report potential bugs here.

@osma
Copy link
Contributor Author

osma commented Sep 7, 2022

This is great news! @nikopartanen and @mariguttorm are currently testing the Northern Sámi lemmatization on real world example texts.

@nikopartanen
Copy link

Thank you @adbar! We made a small test file for Northern Sámi. The accuracy is around 75%, although the text also had some Finnish words and names. The file is here, manually checked lemmas at right side column:

https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21

The text is from our Northern Sámi project announcement, so it may not be perfectly representative, but at least it's ours to share and work with.

One additional comment:

lemmatize("buorebut",` lang=("se"))

>  'būres'

The correct lemma would be bures, ū is only used in dictionaries and similar environments to show the pronunciation of long u here. It shouldn't appear in lemmas within this context, but probably pops up in the training data.

@nikopartanen
Copy link

nikopartanen commented Sep 7, 2022

I added here a version that contains the Simplelemma predictions in the third row, so it is easier to measure the accuracy and evaluate the current result.

https://gist.github.com/nikopartanen/b32f17a6e85dd8ebd02ad24968783a21

@adbar
Copy link
Owner

adbar commented Sep 7, 2022

Hi @nikopartanen, thanks for the evaluation!
My impression is that the lemmatizer mostly behaves as expected, it rarely introduces mistakes (i.e. wrong lemmata), nearly all errors are tokens which do not get lemmatized and stay as is. Bearing that in mind and considering the small size of the training data I would say the accuracy isn't bad at all.

Thanks for the suggestion, I will correct the entries comprising the ū symbol in the training data.

You could try to chain Northern Sámi and Finnish lemmatization to see if it changes something on your sample: lang=("se", "fi").

@adbar
Copy link
Owner

adbar commented Oct 5, 2022

Hi @nikopartanen & @osma, have you tried the chain described above and did it improve the results?

Also: since support has been added, can I close this issue for now?

@nikopartanen
Copy link

I think the current behaviour is about as good as we can get with the current materials. If there are more lemmatized materials somewhere, then training the system with extended data could be done, but the current result is also certainly useful as a part of larger pipelines etc. The issue can be closed now, thank you very much for your work on this topic!

@osma
Copy link
Contributor Author

osma commented Oct 28, 2022

Thanks again from my part as well. I will close the issue.

@osma osma closed this as completed Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants