OpenNLP model generator computes models for Apache OpenNLP from Universal Dependencies annotated language files. OpenNLP supports natural language processing with tools like: sentence detector, tokenizer, part of speech tagger, lemmatizer etc. However, models for various languages are not easily available. This project allows to train and evaluate models for any language supported by Universal Dependencies treebank.
Pre-trained models for various languages are available on model page
For now models for the following languages are automatically computed:
- cs - czech
- da - danish
- de - german
- el - greek
- en - english
- es - spanish
- fi - finnish
- fr - french
- he - hebrew
- it - italian
- ja - japanese
- ko - korean
- no - norwegian
- pl - polish
- pt - portugal
- ru - russian
- sv - swedish
- uk - ukrainian
- zh - chinese
Pre-trained models can be used in SOLR analyzers. Example analyzer chain is presented below:
<analyzer>
<!-- tokenizer -->
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="xy-sentence-detector.onlpm" tokenizerModel="xy-tokenizer.onlpm"/>
<!-- helper filters, optional -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<!-- part of speech tagging -->
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="xy-pos-tagger.onlpm"/>
<!-- lemmatizer -->
<filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="xy-lemmatizer.onlpm"/>
<!-- other necessary filters TypeTokenFilterFactory, TypeAsPayloadFilterFactory, SynonymGraphFilterFactory etc -->
</analyzer>
Here is a simple program to interactively verify openNLP models. The program waits for input text, then after analysis, it prints tokens, part-of-speech and lemma to console
public class InteractiveModelVerifier {
public static void main(String[] args) throws Exception {
var modelDirectory = Path.of("<directory where *.onlpm models are stored>");
var language = "<language>";
verifyModels(modelDirectory, language);
}
private static void verifyModels(Path modelDirectory, String language) throws Exception {
var sentenceDetector = new SentenceDetectorME(new SentenceModel(modelDirectory.resolve(String.format("%s-sentence-detector.onlpm", language))));
var tokenizer = new TokenizerME(new TokenizerModel(modelDirectory.resolve(String.format("%s-tokenizer.onlpm", language))));
var posTagger = new POSTaggerME(new POSModel(modelDirectory.resolve(String.format("%s-pos-tagger.onlpm", language))));
var lemmatizer = new LemmatizerME(new LemmatizerModel(modelDirectory.resolve(String.format("%s-lemmatizer.onlpm", language))));
while (true) {
System.out.println("Enter text or 'q' to quit");
var reader = new BufferedReader(new InputStreamReader(System.in));
var line = reader.readLine();
if ("q".equalsIgnoreCase(line)) {
break;
} else {
verifyModels(line, sentenceDetector, tokenizer, posTagger, lemmatizer);
}
}
}
private static void verifyModels(String text, SentenceDetector sentenceDetector, Tokenizer tokenizer, POSTagger posTagger, Lemmatizer lemmatizer) {
var sentences = sentenceDetector.sentDetect(text);
for (var sentence : sentences) {
System.out.println(sentence);
var tokens = tokenizer.tokenize(sentence);
var posTags = posTagger.tag(tokens);
var lemmas = lemmatizer.lemmatize(tokens, posTags);
for (int i = 0; i < tokens.length; i++) {
System.out.println(String.format("%s\t%s\t%s", tokens[i], posTags[i], lemmas[i]));
}
System.out.println();
}
}
}
Universal Dependencies treebank consists of conllu files for many languages. Conllu file contains annotated sentences in a particular language. Annotations describe tokens and part of speech and lemma for every token. Possible POS tags are listed here
Models are trained on original text and normalized text (lowercased and folded to ASCII), so it should work for both variants. Such models may be used in Apache SOLR or Elasticsearch which supports OpenNLP analyzers.
The process of training and evaluation of models roughly consists of the following steps:
- Download universal dependencies treebank (only if does not exist locally or newer version is available).
- Unpack conllu files for a particular language
- For every supported trainer (sentence-detector, tokenizer, pos-tagger, lemmatizer) perform further steps. Training is performed only if a model does not exist or newer conllu file is available.
- Read the sentences from conllu file, concatenate the original sentences with normalized sentences
- Optional: Try to fix the data (in example for 'de' language)
- Convert sentences to sample stream for a particular trainer (token sample stream, lemma sample stream etc)
- Train and evaluate model. Several available algorithms are tried and evaluated. Only the best one is choosen.
- Save model and evaluation report.
All files (input or generated models) are processed in directory $HOME/.cache/opennlp-model-generator (or its subdirectories)
Several models were trained for different language types. The results of their evaluation are presented below. Available sentences are divided into training/evaluation sets. Every 10th sentence goes to evaluation set. 90% of sentences is used for training.
- language: language code + language name
- training sentences: approximate number of training sentences
- models: training algorithm (algorithm with the best evaluation score) + score (ranging from 0.0 to 1.0)
These languages use alphabetic latin script with native diacritic characters. Words are separated by whitespace.
language | training sentences | sentence-detector | tokenizer | pos-tagger | lemmatizer |
---|---|---|---|---|---|
de german |
65k | MAXENT_QN 0.72 |
MAXENT_QN 0.99 |
MAXENT 0.94 |
MAXENT 0.96 |
en english |
35k | MAXENT_QN 0.74 |
MAXENT_QN 0.99 |
MAXENT 0.94 |
MAXENT 0.98 |
es spanish |
30k | MAXENT_QN 0.96 |
MAXENT_QN 0.99 |
MAXENT 0.94 |
MAXENT 0.98 |
fr french |
25k | MAXENT_QN 0.92 |
MAXENT_QN 0.99 |
MAXENT 0.95 |
MAXENT 0.98 |
pl polish |
36k | MAXENT 0.95 |
MAXENT_QN 0.99 |
MAXENT 0.96 |
MAXENT 0.96 |
Models generated for such types of languages have good quality. These types of languages are supported very well. Sentence detection score is relatively low, because many sentences in the sample were not properly ended.
These languages use alphabetic non-latin scripts (greek, cyrillic). Words are separated by whitespace.
language | training sentences | sentence-detector | tokenizer | pos-tagger | lemmatizer |
---|---|---|---|---|---|
el greek |
2k | MAXENT_QN 0.90 |
MAXENT_QN 0.99 |
PERCEPTRON 0.95 |
MAXENT 0.95 |
ru russian |
99k | MAXENT_QN 0.93 |
MAXENT_QN 0.99 |
MAXENT 0.96 |
MAXENT 0.97 |
uk ukrainian |
6k | MAXENT 0.91 |
PERCEPTRON 0.99 |
MAXENT 0.94 |
MAXENT 0.94 |
These types of languages are also well supported.
These languages are commonly written from right to left. Vovels are often omitted. Words are separated by whitespace.
language | training sentences | sentence-detector | tokenizer | pos-tagger | lemmatizer |
---|---|---|---|---|---|
ar arabic |
7k | MAXENT_QN 0.71 |
MAXENT_QN 0.97 |
MAXENT 0.93 |
Serialization exception |
he hebrew |
8k | PERCEPTRON 0.94 |
MAXENT_QN 0.92 |
MAXENT 0.94 |
MAXENT 0.96 |
Evaluation score is a bit lower for these languages. Lemmatizer model training for arabian language fails. Computed model cannot be serialized. Don't know the reason.
These languages use logographs/syllabic scripts. Words usually are not separated which causes problems with tokenization.
language | training sentences | sentence-detector | tokenizer | pos-tagger | lemmatizer |
---|---|---|---|---|---|
ja japanese |
16k | MAXENT_QN 0.96 |
NAIVEBAYES 0.79 |
PERCEPTRON 0.96 |
MAXENT 0.97 |
ko korean |
30k | MAXENT_QN 0.94 |
MAXENT_QN 0.99 |
MAXENT 0.89 |
MAXENT 0.90 |
zh chinese |
9k | MAXENT 0.98 |
MAXENT_QN 0.91 |
PERCEPTRON 0.94 |
MAXENT 0.99 |
The results are not that impressive. Tokenization quality for japanese is quite low. Tokenizer seems not to support well such types of languages. Maybe if tokenizer have a dictionary of "known words" then trained model would be better. POS tagging/lemmatization for korean language is also not good. Chinese tokenization quality is higher than for japanese. Chinese words are shorter than japanese words, it means that surrounding context is shorter. This may explain why tokenizer better segments chinese words than japanese words.