Morphosyntactic tagger for Polish, the winner of PolEval 2020 Task 2: Morphosyntactic tagging of Middle, New and Modern Polish; successor of KRNNT.
Model | Accuracy | Acc on known | Acc on ign | Acc on manual known |
---|---|---|---|---|
KFTT train+devel wo_morf | 95.63% | 95.95% | 81.91% | 67.30% |
KFTT train+devel | 95.73% | 96.07% | 81.02% | 67.81% |
KRNNT train | 92.07% | 93.00% | 48.63% |
KFTT train+devel accuracy on different parts of the test corpus:
Corpus | Period | Accuracy | Acc on known | Acc on ign | Acc on manual |
---|---|---|---|---|---|
KorBa — a corpus of 17th and 18th century | Middle | 94.35% | 94.83% | 79.43% | 73.87% |
a corpus of 19th century | New | 96.94% | 97.15% | 83.24% | 78.39% |
1M subcorpus of the National Corpus of Polish NKJP | Modern | 97.37% | 97.48% | 87.78% | 84.07% |
Install PyTorch with a compatible version of CUDA with your drivers.
pip install torch
Install requirements:
pip install -r requirements.txt
Download test data:
NAME="test"
mkdir input_data
cd input_data
wget http://poleval.pl/task2/${NAME}-plain.tar.gz
tar -xf ${NAME}-plain.*
cd ..
Convert DAGs to JSONL format:
mkdir data
python3 dag_to_jsonl.py "input_data/${NAME}-plain/*" data/${NAME}-plain.jsonl poleval2020-${NAME}
Prepare data for tokenization:
python3 jsonl_to_tsv_segmentation_every_char.py data/${NAME}-plain.jsonl data/${NAME}-plain.segmentation.tsv.char
Download tokenization models:
mkdir models
cd models
wget https://github.com/kwrobel-nlp/kftt/releases/download/v0.1/ModelB+allF+CRF.pt
wget https://github.com/kwrobel-nlp/kftt/releases/download/v0.1/ModelB+CRF.pt
cd ..
Choose tokenization model:
MODEL_DIR="models/ModelB+allF+CRF.pt"
or (wo_morf
):
MODEL_DIR="models/ModelB+CRF.pt"
Tokenize:
INPUT="data/${NAME}-plain.segmentation.tsv.char"
OUTPUT=${INPUT}.`basename "$MODEL_DIR"`
python3 predict_segmentation.py ${MODEL_DIR} $INPUT "$OUTPUT"
# Tokenization: 14.21 seconds
Prepare data for tagging:
python3 test_data_tagging.py data/${NAME}-plain.jsonl $OUTPUT data/${NAME}-plain.segmentation.tsv.char ${OUTPUT}.tagging
echo ${OUTPUT}.tagging
# data/test-plain.segmentation.tsv.char.ModelB+allF+CRF.pt.tagging
mkdir test_data_temp
cp ${OUTPUT}.tagging test_data_temp/test.txt
Download model:
cd models
wget https://github.com/kwrobel-nlp/kftt/releases/download/v0.1/train+dev.zip
unzip train+dev.zip
cd ..
Tag:
export BERT_MODEL=models/train+dev
time python3 run_ner_predict.py --data_dir ./test_data_temp/ \
--model_type xlmroberta \
--labels models/train+dev/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $BERT_MODEL \
--max_seq_length 512 \
--per_gpu_eval_batch_size 1 \
--do_predict
# Tagging: 16.61 seconds
Copy results:
cp ${OUTPUT_DIR}/test_predictions_long.txt ${OUTPUT}.tagged
Join results and split to separate files:
mkdir predictions
python3 split_to_separate_files.py data/${NAME}-plain.jsonl ${OUTPUT}.tagged predictions --sep $'\t'
Evaluate:
python3 poleval-eval.py predictions input_data/${NAME}-disamb 2>/dev/null
Results:
Poleval 2020 Task 2
Evaluating files in: predictions
against 252 gold standard files in: input_data/test-disamb
Accuracy (Your score!): 0.9572980397053314
Tokens total: 40045
Correct tokens: 38335
Unknown tokens: 901 (2.25%)
Correct unknown: 730
Accuracy on unknown: 0.8102108768035516
Known tokens: 39144 (97.75%)
Accuracy on known: 0.960683629675046
Manual tokens: 1482 (known 581 + ign 901)
Correct manual: 1124
Accuracy on manual: 0.7584345479082322
Accuracy manual known: 0.6781411359724613
Accuracy manual ign: 0.8102108768035516