Dependency parser implementation used by Koç University team in Conll17 shared task. Our team ranked 7th as posted in the results.
This document will guide you to get a working copy of dependency parser software on your machine. The system has two parts; language modelling and dependency parsing. Most updated version of source can be found on the official repo.
We use text files tokenized by UDPipe, please make sure that you have installed it from their official repository.
Our entire software runs on Julia, so it should be installed on your system as well. Julia can be installed from their official download page. After the requirements are met, follow the installation instructions below.
Clone the repository to install the parser and dependencies:
git clone https://github.com/kirnap/ku-dependency-parser.git && cd ku-dependency-parser
julia installer.jl
Dependency parser related code is under parser folder and Language Model related code is under lm.
To be able to train parser on a specific language, first you need to have a pre-trained language model so that you can generate context and word embeddings for that language. Here are the steps to train a language model:
Switch to language model directory:
cd lm
If you do not have raw version of .conllu
formatted file, run the following to obtain tokenized raw text:
udpipe --output=horizontal none --outfile texts/{}.txt *.conllu
Create a vocabulary file from the text file that is tokenized by UDPipe (provided by Conll17 task organizers), please notice that output file contains word-frequency information for the supplied text file:
julia wordcount.jl --textfile 'input text file' --output 'vocabulary-file'
Language model training expects the vocabulary file not contain any frequency information, thus using linux tools remove that frequency information:
awk '{$1="";print $0}' path/to/vocabulary-file > path/to/words-file
Create a file that includes top N (e.g. 10000) words to be used during the training:
head -n10000 path/to/words-file > path/to/vocab-file
To train the language model, you need to run the following command:
julia lm_train.jl --trainfile 'udpipe-output.txt' --vocabfile 'path/to/vocab-file' --wordsfile 'path/to/words-file' --savefile model.jld
Warning: Please be aware that language model training takes approximately 24 hours on Tesla K80 GPU.
Go to the parent directory and run the following command:
julia main.jl --load '/path/to/pre-trained language model' --datafiles 'path/to/train_file.conllu' 'path/to/dev_file.conllu' --otrain 'number of epochs'
For more detailed options, run:
julia main.jl --help
For more help, you are welcome to open an issue, or directly contact okirnap@ku.edu.tr.