This project's goal is to develop a Norwegian language model for Python library spaCy.
The model was based on an unpublished dataset developed by Nasjonalbiblioteket, Schibsted and Language Technology Group at UiO (to be published autumn 2018) in .conllu format and following Universal Dependencies annotation. It is similar to this treebank: https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal only it contains NER tags in the 10th column of each token line and interprets some parts of speech differently (mostly pronouns/determiners).
First of all, I converted the .conllu files (train, dev and test) to .json, because that's the format spaCy supports:
python -m spacy convert path/to/train-file.conllu path/to/output/directory -m
python -m spacy convert path/to/dev-file.conllu path/to/output/directory -m
python -m spacy convert path/to/test-file.conllu path/to/output/directory -m
I chose the '-m' option to be able to append morphological features to POS tags, so I end up with a tag looking like this: 'NOUN__Definite=Ind|Gender=Neut|Number=Sing' instead of just plain 'NOUN'.
To train the model I run the command:
python -m spacy train nb path/to/output/directory path/to/train-file.json path/to/dev-file.json -n 10
This gave me the following results:
Itn. | P.Loss | N.Loss | UAS | NER P. | NER R. | NER F. | Tag % | Token % | |
---|---|---|---|---|---|---|---|---|---|
0 | 17237.645 | 1388.183 | 84.014 | 75.985 | 74.985 | 75.482 | 93.674 | 100.000 | 5044.2 |
1 | 423.721 | 13.932 | 86.053 | 78.795 | 78.276 | 78.535 | 94.655 | 100.000 | 5988.9 |
2 | 350.019 | 9.434 | 86.922 | 77.964 | 77.917 | 77.941 | 95.026 | 100.000 | 6096.9 |
3 | 306.036 | 7.817 | 87.333 | 79.476 | 78.097 | 78.781 | 95.146 | 100.000 | 5308.6 |
4 | 272.160 | 6.333 | 87.579 | 80.036 | 79.174 | 79.603 | 95.198 | 100.000 | 5649.8 |
5 | 247.511 | 5.729 | 87.845 | 78.030 | 78.217 | 78.123 | 95.292 | 100.000 | 5796.3 |
6 | 226.625 | 5.190 | 87.824 | 78.652 | 78.935 | 78.793 | 95.363 | 100.000 | 5884.3 |
7 | 208.418 | 4.364 | 87.852 | 78.481 | 77.917 | 78.198 | 95.308 | 100.000 | 5853.0 |
8 | 191.116 | 3.985 | 88.132 | 77.718 | 77.858 | 77.788 | 95.212 | 100.000 | 5856.2 |
9 | 178.484 | 3.454 | 88.129 | 78.797 | 77.618 | 78.203 | 47.975 | 100.000 | 5853.8 |
The best model seems to be model 6, so I turn it into a package for convenience:
python -m spacy package path/to/model path/to/output/directory
Before I do that I manually fill out information in file meta.json that lies in path/to/model (name, version, description etc). The script will throw an exception if those fields are empty.
In order to install the package I now go to the package's directory and run:
python setup.py sdist
and
pip install /path/to/dist/name-of-model.tar.gz
I can now load the model from Python shell with:
spacy.load('name-of-model')
or evaluate it:
python -m spacy evaluate name-of-model path/to/test-file.json
which gives me the folllowing result:
Time 5.00 s
Words 30034
Words/s 6010
TOK 100.00
POS 94.40
UAS 87.28
LAS 84.46
NER P 70.85
NER R 70.34
NER F 70.59
I also added a custom pipeline component, EntityMatcher. Its code is based on this solution: https://support.prodi.gy/t/adding-custom-attribute-to-doc-having-ner-use-attribute/356/6
In order for it to work I had to add entity_matcher to the model's meta.json pipeline and to the language's factory (in __init__.py). The files that the component extracts data from are placed in a separate folder entity_matcher and their names contain one of the 4 labels used: PER, LOC, ORG or MISC.
EntityMatcher extracts data from various text files and labels them with an entity tag (PER, LOC, ORG or MISC). It enforces 'PROPN___' tag on all the entity tokens too (since all named entities are proper nouns). One can also check if the entity label comes from custom EntityMatcher or built-in EntityRecognizer by using the 'via_patterns' extension, which returns True in the first case and False in the other.
To add entities with a custom label check the guide here (using Matcher/PhraseMatcher that doesn't require context/sentences): https://spacy.io/usage/linguistic-features#adding-phrase-patterns
, especially the section Adding on_match rules and examples. Or here if your goal is to train EntityRecognizer with custom labels (requires whole sentences tagged with named entity label): https://spacy.io/usage/examples#section-training
Examples of use can be found in spacy_examples.py