GitHub - PeterisP/LVTagger

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.hooks		.hooks
MorphoCRF		MorphoCRF
NERdicts		NERdicts
NerTrainingData		NerTrainingData
distsim		distsim
docs		docs
morphomodel		morphomodel
nermodel		nermodel
src		src
.gitignore		.gitignore
LETA_lemma_freqs.txt		LETA_lemma_freqs.txt
LICENSE.txt		LICENSE.txt
NER eksperimenti 2013.txt		NER eksperimenti 2013.txt
README.txt		README.txt
Stanford README.txt		Stanford README.txt
init.sh		init.sh
lv-ner-tag.prop		lv-ner-tag.prop
lv-ner-train.prop		lv-ner-train.prop
morpho_test.sh		morpho_test.sh
morpho_train.sh		morpho_train.sh
morphotagger.bat		morphotagger.bat
morphotagger.sh		morphotagger.sh
ner_eksperimenti		ner_eksperimenti
ner_pipe.bat		ner_pipe.bat
ner_pipe.sh		ner_pipe.sh
ner_stats.bat		ner_stats.bat
ner_test.sh		ner_test.sh
ner_test.tab		ner_test.tab
ner_train.sh		ner_train.sh
ner_train.tab		ner_train.tab
ner_train_all.bat		ner_train_all.bat
nertagger.sh		nertagger.sh
pom.xml		pom.xml
sample.conll		sample.conll
sample.txt		sample.txt
sample_ner.conll		sample_ner.conll
tagFolders.sh		tagFolders.sh
test_file.conll		test_file.conll
test_file.tagged.txt		test_file.tagged.txt
test_file.txt		test_file.txt

Repository files navigation

LV Tagger
----------------------------------------------

Latvian morphological tagger (analysis + disambiguation) and named entity recognition module.

(c) 2012 University of Latvia, Institute of Mathematics and Computer science
Machine learning code based on Stanford NLP kit at http://nlp.stanford.edu/software/

LICENSE

The software is licensed under the full GPL. Please see the file LICENCE.txt
The included text corpus data, books, newspaper extracts and dictionaries are copyrighted by their respective authors, and are available for research purposes only.

CONTACT

For information, bug reports, and any problems, contact:
Pēteris Paikens
PeterisP@gmail.com

USAGE

Build using maven or download from oss.sonatype.org (https://search.maven.org/remotecontent?filepath=lv/ailab/morphology/tagger/2.1.0/tagger-2.1.0-jar-with-dependencies.jar)
Run the included morphotagger.sh script for tokenization and morphological tagging of plaintext data.
Input one sentence or paragraph per line; end processing with an empty line.
Output format is JSON by default, or tab-delimited columns of token-tag-lemma when run as './pipetool.sh -vert'.
./morphotagger.sh --help
for information on other input/output formats.
All input and output should be in UTF-8.

File interaction with standard UNIX stdin/stdout, i.e.
./morphotagger.sh <inputfile.txt >outputfile.txt

JSON example: [{"Vārds":"es","Marķējums":"pp10snn","Pamatforma":"es"}, {"Vārds":"roku","Marķējums":"vmnipt11san","Pamatforma":"rakt"}, {"Vārds":"roku","Marķējums":"ncfsa4","Pamatforma":"roka"}]
'Vārds' - exact surface form of the token; 'Marķējums' - morphological tag; first letter is part of speech, and each remaining letter denotes a separate morphological feature; and 'Pamatforma' contains the identified lemma.

For named entity tagging with a pre-trained model, run the included nertagger.sh script for a basic three-category (person, organization, location) tagging of data that is previously annotated with morphological features (see test_file.txt for an example)

SYSTEM REQUIREMENTS

JRE 1.7+ should be installed and available on path.
~2gb memory should be available.

PERFORMANCE ESTIMATES

Performance will vary depending on text content, genre and quality.
For morphological tagging in current test data we are seeing accuracy of 92.8% for the full morphological tag/lemma, and 98.2% for the part of speech disambiguation.

COMPILING FROM SOURCE

Use maven to compile and download dependencies (morphological word analysis module from https://github.com/PeterisP/morphology and the pretrained postagging model).

DOWNLOADS

https://search.maven.org/artifact/lv.ailab.morphology/tagger/

REFERENCES

Properties of Latvian morphological tagger are published in http://www.ep.liu.se/ecp_article/index.en.aspx?issue=085;article=024
The named entity recognition system was initially described at http://www.booksonline.iospress.nl/Content/View.aspx?piid=32333