Pytorch implementation of experiments described in "Conditioned Text Generation with Transfer for Closed-Domain Dialogue Systems" by S. d'Ascoli, A. Coucke, F. Caltagirone, A. Caulier, M. Lelarge accepted for publication at the 8th International Conference on Statistical Language and Speech Processing (2020). This is a work in progress, feel free to reach out for any question.
Requirements: Python3.6, pip
virtualenv venv
. venv/bin/activate
pip install -e .
You might need to download some NLTK resources:
>>> import nltk
>>> nltk.download('punkt')
The abstract class automatic-data-generation.data.base_dataset.py
provides the interface for representing a training dataset. To implement a new dataset format, write a class inheriting from Dataset and implement its abstract methods.
You then need to allow for a new dataset_type
in the training script
automatic_data_generation/train_and_eval_cvae.py
which should be the name
of the sub-directory in your data folder.
Finally, you should update the data factory create_dataset
in
automatic_data_generation/utils/utils.py
.
The reservoir dataset of unannotated queries used for transfer experiments in the paper is not publicly available. To explore the query transfer method, you need to add you own None sentences in csv format in a sub-directory in your data folder.
You need to first download the InferSent model by running the following executable:
./automatic_data_generation/data/get_infersent.sh
To embed your None sentences, run the following command:
python automatic_data_generation/data/utils/embed_intents.py --dataset_path
./your/none/data/path
You then need to allow for a new none_type
in the training script
automatic_data_generation/train_and_eval_cvae.py
which should be the name
of the sub-directory in your data folder.
Add the index of the utterances in the csv file to the
NONE_COLUMN_MAPPING
dictionary in
automatic_data_generation/data/utils/utils.py
. If the utterance is the
first (resp. n) field of the csv, add 0 (resp. n+1).
You should be good to go.
Use the script automatic-data-generation.train_and_eval_cvae.py
to train a model, generate sentences, and evaluate their quality.
For a simple run without query transfer, you may run:
python automatic_data_generation/train_and_eval_cvae.py -ep 10 --n-generated 100 --dataset-size 125
Possible options are:
--dataset-size
: number of sentences in the training dataset--none-size
: number of None sentences to be added to the training dataset--none-type
: type of None sentences--restrict-to-intent
: list of intents to filter on for training--n-epochs
: number of epochs for training--n-generated
: number of generated sentences--infersent-selection
: possible query transfer schemes,unsupervised
is the normal scheme,supervised
is the pseudolabelling baseline, andNO_INFERSENT_SELECTION
deactivates the feature--cosine-thresholds
: the selection threshold for query transfer (defaults to 0.9)alpha
: the parameter regulating transfer
If you have added your own None type, a typical run may be:
python automatic_data_generation/train_and_eval_cvae.py -ep 50 --n-generated
2000 --dataset-size 125 --none-size 125 --none-type mynonetype
--infersent-selection unsupervised --cosine-threshold 0.9 --alpha:0.1
An folder will be created with the following elements:
load
: a folder with amodel.pth
file and its associatedconfig.json
and avocab.pth
file containing the vocabularytensorboard
: a folder with the checkpoints for tensorboardrun.pkl
: a dictionnary with every runtime parameterstrain_*.csv
: the training datasettrain_*_augmented.csv
: the training dataset augmented with generated sentencesvalidate_*.csv
: the validation dataset