Skip to content
This repository has been archived by the owner on Aug 15, 2023. It is now read-only.

navikt/ai-lab-spacy-bokmaal

Repository files navigation

Norwegian model for spaCy

This project's goal is to develop a Norwegian language model for Python library spaCy.

About the data

The model was based on an unpublished dataset developed by Nasjonalbiblioteket, Schibsted and Language Technology Group at UiO (to be published autumn 2018) in .conllu format and following Universal Dependencies annotation. It is similar to this treebank: https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal only it contains NER tags in the 10th column of each token line and interprets some parts of speech differently (mostly pronouns/determiners).

How to train the model

First of all, I converted the .conllu files (train, dev and test) to .json, because that's the format spaCy supports:

python -m spacy convert path/to/train-file.conllu path/to/output/directory -m
python -m spacy convert path/to/dev-file.conllu path/to/output/directory -m
python -m spacy convert path/to/test-file.conllu path/to/output/directory -m

I chose the '-m' option to be able to append morphological features to POS tags, so I end up with a tag looking like this: 'NOUN__Definite=Ind|Gender=Neut|Number=Sing' instead of just plain 'NOUN'.

To train the model I run the command:

python -m spacy train nb path/to/output/directory path/to/train-file.json path/to/dev-file.json -n 10

This gave me the following results:

Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token %
0 17237.645 1388.183 84.014 75.985 74.985 75.482 93.674 100.000 5044.2
1 423.721 13.932 86.053 78.795 78.276 78.535 94.655 100.000 5988.9
2 350.019 9.434 86.922 77.964 77.917 77.941 95.026 100.000 6096.9
3 306.036 7.817 87.333 79.476 78.097 78.781 95.146 100.000 5308.6
4 272.160 6.333 87.579 80.036 79.174 79.603 95.198 100.000 5649.8
5 247.511 5.729 87.845 78.030 78.217 78.123 95.292 100.000 5796.3
6 226.625 5.190 87.824 78.652 78.935 78.793 95.363 100.000 5884.3
7 208.418 4.364 87.852 78.481 77.917 78.198 95.308 100.000 5853.0
8 191.116 3.985 88.132 77.718 77.858 77.788 95.212 100.000 5856.2
9 178.484 3.454 88.129 78.797 77.618 78.203 47.975 100.000 5853.8

The best model seems to be model 6, so I turn it into a package for convenience:

python -m spacy package path/to/model path/to/output/directory

Before I do that I manually fill out information in file meta.json that lies in path/to/model (name, version, description etc). The script will throw an exception if those fields are empty.

In order to install the package I now go to the package's directory and run:

python setup.py sdist

and

pip install /path/to/dist/name-of-model.tar.gz

Loading and using the model

I can now load the model from Python shell with:

spacy.load('name-of-model')

or evaluate it:

python -m spacy evaluate name-of-model path/to/test-file.json

which gives me the folllowing result:

Time               5.00 s         
Words              30034          
Words/s            6010           
TOK                100.00         
POS                94.40          
UAS                87.28          
LAS                84.46          
NER P              70.85          
NER R              70.34          
NER F              70.59   

Custom EntityMatcher

I also added a custom pipeline component, EntityMatcher. Its code is based on this solution: https://support.prodi.gy/t/adding-custom-attribute-to-doc-having-ner-use-attribute/356/6

In order for it to work I had to add entity_matcher to the model's meta.json pipeline and to the language's factory (in __init__.py). The files that the component extracts data from are placed in a separate folder entity_matcher and their names contain one of the 4 labels used: PER, LOC, ORG or MISC.

EntityMatcher extracts data from various text files and labels them with an entity tag (PER, LOC, ORG or MISC). It enforces 'PROPN___' tag on all the entity tokens too (since all named entities are proper nouns). One can also check if the entity label comes from custom EntityMatcher or built-in EntityRecognizer by using the 'via_patterns' extension, which returns True in the first case and False in the other.

Extending the model with custom entity labels

To add entities with a custom label check the guide here (using Matcher/PhraseMatcher that doesn't require context/sentences): https://spacy.io/usage/linguistic-features#adding-phrase-patterns

, especially the section Adding on_match rules and examples. Or here if your goal is to train EntityRecognizer with custom labels (requires whole sentences tagged with named entity label): https://spacy.io/usage/examples#section-training

Examples

Examples of use can be found in spacy_examples.py

About

Norwegian language model for spaCy

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages