Research code for the COLING 2018 paper "Neural Transition-based String Transduction for Limited-Resource Setting in Morphology" by Peter Makarov and Simon Clematide
This code basis allows for the reproduction of the results of our experiments. Note that there is a separate git repository that contains all our test data results of our reported systems in a consistently named and organized way. And it also contains the respective gold standard data.
The paper and slides of the talk are in the repository as well.
Our environment when running the experiments:
- Debian 9
- python 2.7.13
- dynet v2.0.2
- a few external python modules: docopt progressbar editdistance
All our experiments were run on CPU, which can take some hours for larger datasets.
git clone --recursive
In order to setup the Chinese Restaurant Character Aligner binaries:
make setup
There is a configuration file for each dataset. The easiest way is to generate a shell script that contains all commands for training the models (note that the shell commands are not executed by the launcher by default).
All results are stored in a parent folder of this repository ../paper2018/results
.
All models are automatically evaluated.
cd lib && python ./launch_dataset.py ../dataset_configs/sgm2017low_config.py > sgm2017low.sh ; bash -x sgm2017low.sh
The test set results for a language LANG and seed SEED can be found in ../paper2018/results/sgm2017low/x-haem-acls-psgm2017low-n200_1-w100_20_100_T-e60_15-oADADELTA_0-mmle-x/{LANG}_/s_{SEED}}/f.beam4.test.predictions
.
The relevant experimental variables are encoded in the directory below the shortname of the data set (e.g. sgm2017low) as follows:
-haem
= CA-hard
= HA*-acls
= LCS (longest common substring alignment)-acrp
= CRP (Chinese restaurant process)-n200_1
= One hidden layer of dimension 200-e60_15
= 60 epochs with an early stopping patience of 15-oADADELTA
= ADADELTA optimizer-mmle
= Maximum Likelihood Estimation
cd lib && python launch_dataset.py ../dataset_configs/sgm2017medium_config.py > sgm2017medium.sh ; bash -x sgm2017medium.sh
cd lib && python launch_dataset.py ../dataset_configs/sigmorphon2016_config.py > sgm2016low.sh ; bash -x sgm2016low.sh
cd lib && python launch_dataset.py ../dataset_configs/sigmorphon2016_config.py > sgm2016low.sh ; bash -x sgm2016low.sh
cd lib && python launch_dataset.py ../dataset_configs/celex_config.py > celexall.sh ; bash -x celexall.sh
for task in 2PIE 2PKE 13SIA rP ; do
python launch_dataset.py ../dataset_configs/celex${task}_config.py > celexbytask${task}.sh ; bash -x celexbytask${task}.sh ;
done
The ensembling of models and their evaluation works analogue to the model launching:
cd lib && python launch_dataset_ensembling.py ../dataset_configs/sgm2017low_config.py
The exact definition of which results from which result directory actually went into which table of our paper (and the corresponding test data in https://github.com/ZurichNLP/coling2018-neural-transition-based-morphology-test-data) can be found in the makefiles in https://github.com/ZurichNLP/coling2018-neural-transition-based-morphology/coling2018-datasets and in https://github.com/ZurichNLP/coling2018-neural-transition-based-morphology-test-data.
Feel free to ask simon.clematide@uzh.ch for questions related to the data or makarov@cl.uzh.ch for questions regarding the code.
We want to thank Tatyana Ruzsics for all her valuable input, code and inspiration!