This is an implmentation of the DRESS (Deep REinforcement Sentence Simplification) model described in Sentence Simplification with Deep Reinforcement Learning
The wikismall and wikilarge datasets can be downloaded on Github or on Google Drive.
8 references wikilarge development and test sets can be downloaded here https://github.com/cocoxu/simplification/tree/master/data/turkcorpus
Copyright of the newsela dataset belongs to https://newsela.com. Please contact newsela.com to obtain the dataset https://newsela.com/data/
If you are looking for system output and don't bother to install dependencies and train a model (or run a pre-trained model), the all-system-output
folder is for you.
Note that this model is tested using an old version of torch (available here)
CUDA_VISIBLE_DEVICES=$ID th train.lua --learnZ --useGPU \
--model EncDecAWE \
--attention dot \
--seqLen 85 \
--freqCut 4 \
--nhid 256 \
--nin 300 \
--nlayers 2 \
--dropout 0.2 \
--lr $lr \
--valid $valid \
--test $test \
--optimMethod Adam \
--save $model \
--train $train \
--validout $validout --testout $testout \
--batchSize 32 \
--validBatchSize 32 \
--maxEpoch 30 \
--wordEmbedding $wembed \
--embedOption fineTune \
--fineTuneFactor 0 \
| tee $log
Details see experiments/wikilarge/encdeca/train.sh
. Note in newsela
and wikismall
datasets, you should use --freqCut 3
.
If you want to generate simplifications from a pre-trained Encoder-Decoder Attention model, use the following command:
CUDA_VISIBLE_DEVICES=3 th generate_pipeline.lua \
--modelPath $model \
--dataPath $data.test \
--outPathRaw $output.test \
--oriDataPath $oridata.test \
--oriMapPath $orimap | tee $log.test
Details see experiments/wikilarge/encdeca/generate/run_std.sh
.
See details in experiments/wikilarge/dress/train_lm.sh
Create dataset scripts/get_auto_encoder_data/gen_data.sh
See details in experiments/wikilarge/dress/train_auto_encoder.sh
See details in experiments/wikilarge/dress/train_dress.sh
. Run a pre-trained DRESS
model using this script experiments/wikilarge/dress/generate/dress/run_std.sh
.
To train a lexical simplification model, you need to obtain soft word alignments in the training data, which are assigned by a pre-trained Encoder-Decoder Attention model. See details in experiments/wikilarge/dress/run_align.sh
.
After you obtain the alignments, you can train a lexical simplification model using experiments/wikilarge/dress/train_lexical_simp.sh
.
Lastly, you can apply the lexical simplification model with DRESS experiments/wikilarge/dress/generate/dress-ls/run_std.sh
.
https://drive.google.com/open?id=0B6-YKFW-MnbOTVRMSURFbXYxNjg
Please be careful about the automatic evaluation.
You can use our released code and models to produce output for different models (i.e., EncDecA, Dress and Dress-Ls). But please make sure your evaluation settings follow the settings in our paper.
The evaluation pipeline accompanied in our code released produces single reference BLEU scores.
To be consistent with previous work, you should use 8 references wikilarge test set (availabel at https://github.com/cocoxu/simplification/tree/master/data/turkcorpus)
Therefore, to get the numbers on wikilarge, you should use scripts that support multi-bleu evalution (e.g., joshua or mtevalv13a.pl).
Checkout details for BLEU evaluation of wikilarge here
Make sure your FKGL is on corpus level.
The evaluation pipeline accompanied in our code released produces sentence-level SARI scores. You can use this simplification system (available here) to produce corpus level SARI scores.
Checkout details for SARI evaluation here
@InProceedings{D17-1063,
author = "Zhang, Xingxing
and Lapata, Mirella",
title = "Sentence Simplification with Deep Reinforcement Learning",
booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "595--605",
location = "Copenhagen, Denmark",
url = "http://aclweb.org/anthology/D17-1063"
}