Info added about the filter data mode

+ other small modifications
ricsinaruto · Jun 26, 2018 · 190cce5 · 190cce5
1 parent d86856e
commit 190cce5
Showing 1 changed file with 24 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 # Seq2seqChatbots
 
-This repository contains the code that I have written for experiments described in [this](https://tdk.bme.hu/VIK/DownloadPaper/asdad) paper. I made my own problem, hparams and model registrations to the [tensor2tensor](https://github.com/tensorflow/tensor2tensor) library in order to try out different datasets with the [Transformer](https://arxiv.org/abs/1706.03762) modell for training dialog agents. The folders in the repository contain the following content:
-* **docs**: Latex files and pictures required to generate my [paper](https://tdk.bme.hu/VIK/DownloadPaper/asdad). Also check my research [proposal](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/doc/research_proposal.pdf) for a detailed description of my current research interests.
-* **t2t_csaky**: This folder contains all the code that I have written, more detailed description can be found lower.
-* **decode_dir**: Here you can find inference outputs from the various trainings that I have run.
+This repository contains the code that was written for experiments described in [this](https://tdk.bme.hu/VIK/DownloadPaper/asdad) paper. Own problem, hparams and model registrations to the [tensor2tensor](https://github.com/tensorflow/tensor2tensor) library in order to try out different datasets with the [Transformer](https://arxiv.org/abs/1706.03762) modell for training dialog agents. The folders in the repository contain the following content:
+* **docs**: Latex files and pictures required to generate the [paper](https://tdk.bme.hu/VIK/DownloadPaper/asdad).
+* **t2t_csaky**: This folder contains all the code, more detailed description can be found lower.
+* **decode_dir**: Here you can find inference outputs from the various trainings that were run.
 * **wiki_images**: Contains images used for the [wiki](https://github.com/ricsinaruto/Seq2seqChatbots/wiki/Chatbot-and-Related-Research-Paper-Notes-with-Images), where I write about more than 100 publications related to chatbots.
 
 ## Quick Guide
@@ -17,15 +17,15 @@ In order to run something, you will have to call the [main](https://github.com/r
 ```
 python t2t_csaky/main.py --mode=train
 ```
-The mode flag can be one of the following three: *{[generate_data](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#generate-data), [train](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#train), [decode](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#decode)}*. A detailed explanation is given lower, for what each mode does. With version v1.1 I introduced the main and config files, for a more streamlined experience, but if you want more freedom and want to use tensor2tensor commands directly, check the v1.0_README for the old way.
+The mode flag can be one of the following four: *{[generate_data](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#generate-data), [filter data](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#filter-data), [train](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#train), [decode](https://github.com/ricsinaruto/Seq2seqChatbots/tree/master#decode)}*. Additionally an *experiment* mode can be used, where you can speficy what to do inside the *experiment* function of the *[run](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/utils/run.py)* file. A detailed explanation is given lower, for what each mode does. With version v1.1 the main and config files were introduced, for a more streamlined experience, but if you want more freedom and want to use tensor2tensor commands directly, check the v1.0_README for the old way.
 #### [Config](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/config.py)
 You can control the flags and parameters of each mode directly in this file. Furthermore, for each run that you initiate this file will be copied to the appropriate directory, so you can quickly access the parameters of any run. There are some flags that you have to set for every mode (the *FLAGS* dictionary in the config file):
 * **t2t_usr_dir**: Path to the directory where my code resides. You don't have to change this, unless you rename the directory.
 * **data_dir**: The path to the directory where you want to generate the source and target pairs, and other data. The dataset will be downloaded one level higher from this directory into a *raw_data* folder.
 * **problem**: This is the name of a registered problem that tensor2tensor needs. Detailed in the *generate_data* section below.
 
 ### Generate Data
-This mode will download and preprocess the data and generate source and target pairs. Currently i have 6 registered problems, that you can use besides the ones given by tensor2tensor:
+This mode will download and preprocess the data and generate source and target pairs. Currently there are 6 registered problems, that you can use besides the ones given by tensor2tensor:
 * *[persona_chat_chatbot](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/problems/persona_chat_chatbot.py)*: This problem implements the [Persona-Chat](https://arxiv.org/pdf/1801.07243.pdf) dataset (without the use of personas).
 * *[daily_dialog_chatbot](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/problems/daily_dialog_chatbot.py)*: This problem implements the [DailyDialog](http://yanran.li/dailydialog.html) dataset (without the use of topics, dialog acts or emotions).
 * *[opensubtitles_chatbot](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/problems/opensubtitles_chatbot.py)*: This problem can be used to work with the [OpenSubtitles](http://opus.nlpl.eu/OpenSubtitles2018.php) dataset.
@@ -41,11 +41,24 @@ The *PROBLEM_HPARAMS* dictionary in the config file contains problem specific pa
 * *dataset_split*: Specify a train-val-test split for the problem.
 * *dataset_version*: This is only relevant to the opensubtitles dataset, since there are several versions of this dataset, you can specify the year of the dataset that you want to download.
 * *name_vocab_size*: This is only relevant to the cornell problem with separate names. You can set the size of the vocabulary containing only the personas.
+
+### Filter Data
+Run this mode if you want to filter a dataset based on entropy. Currently there are two working clustering methods:
+* *[hash_jaccard](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/data_filtering/hash_jaccard.py)*: Cluster sentences based on the jaccard similarity between them, using the [datasketch](https://github.com/ekzhu/datasketch) library.
+* *[identity_clustering](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/data_filtering/identity_clustering.py)*: This is a very simple clustering method, where only sentences that are exactly the same (syntactically) fall into one cluster.
+
+The *DATA_FILTERING* dictionary in the config file contains the parameters for this mode, which you will have to set. Short explanation:
+* *data_dir*: Specify the directory where the new dataset will be saved.
+* *filter_problem*: Specify the name of the clustering method, can be one of the above.
+* *filter_type*: Whether to filter source, target, or both sides.
+* *treshold*: The entropy treshold above which source-target pairs will get filtered.
+
+You can see some results of the clustering/filtering methods in the *[filtering_visualization](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/scripts/filtering_visualization.ipynb)* jupyter notebook.
 
 ### Train
-This mode allows you to train a model with the specified problem and hyperparameters. Currently I subclassed two models to make small modifications to them:
+This mode allows you to train a model with the specified problem and hyperparameters. Currently there are two subclassed models with small modifications:
 * *[roulette_transformer](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/models/roulette_transformer.py)*: Original transformer modell, now with modified beam search, where roulette-wheel selection can be used to select among the top beams, instead of argmax.
-* *[gradient_checkpointed_seq2seq](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/models/gradient_checkpointed_seq2seq.py)*: Small modification of the lstm based seq2seq model, so that i can user my own hparams entirely. Moreover, before calculating the softmax the LSTM hidden units are projected to 2048 linear units as [here](https://arxiv.org/pdf/1506.05869.pdf). Finally, I tried to implement [gradient checkpointing](https://github.com/openai/gradient-checkpointing) to this model, but currently it is taken out since it didn't give good results.
+* *[gradient_checkpointed_seq2seq](https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/t2t_csaky/models/gradient_checkpointed_seq2seq.py)*: Small modification of the lstm based seq2seq model, so that own hparams can be used entirely. Moreover, before calculating the softmax the LSTM hidden units are projected to 2048 linear units as [here](https://arxiv.org/pdf/1506.05869.pdf). Finally, I tried to implement [gradient checkpointing](https://github.com/openai/gradient-checkpointing) to this model, but currently it is taken out since it didn't give good results.
 
 There are several additional flags that you can specify for a training run in the *FLAGS* dictionary in the config file, some of which are:
 * *train_dir*: Name of the directory where the training checkpoint files will be saved.
@@ -61,10 +74,10 @@ With this mode you can decode from the trained models. The following parameters
 * *beam_size*: Size of the beam, when using beam search.
 * *return_beams*: If False return only the top beam, otherwise return *beam_size* number of beams.
 
-Also, for all 4 training examples given below, I uploaded the checkpoint files [here](https://mega.nz/#!bckTiS6Z!3CJxsl4AyR1W6eUnJ6Viq_cKMhhMh82cFlmA9xbotpo) so you can try them out without needing to train. However, these only work with tensor2tensor version 1.2.1, and v0.9 of this repository.
+Also, for all 4 training examples given below, checkpoint files are uploaded [here](https://mega.nz/#!bckTiS6Z!3CJxsl4AyR1W6eUnJ6Viq_cKMhhMh82cFlmA9xbotpo) so you can try them out without needing to train. However, these only work with tensor2tensor version 1.2.1, and version v0.9 of this repository.
 
 ### Sample conversations from the various trainings
 S2S is a baseline seq2seq model from [this](https://arxiv.org/pdf/1506.05869.pdf) paper, Cornell is the Transformer model trained on Cornell data, Cornell S is similar, but trained with speaker-addressee annotations. OpenSubtitles is the Transformer trained with OpenSubtitles data, and OpenSubtitles F, is the previous training finetuned (further trained) on Cornell speaker annotated data.
-<a><img src="https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/docs/tdk/pics/general_questions.png" align="top" height="550" ></a>
+<a><img src="https://github.com/ricsinaruto/Seq2seqChatbots/blob/master/docs/deep_learning_based_chatbot_models/pics/general_questions.png" align="top" height="550" ></a>
 
-##### If you require any help with running my code or if you want the files of the trained models, just contact me via e-mail and I will make them available. (ricsinaruto@hotmail.com)
+##### If you require any help with running the code or if you want the files of the trained models, write to this e-mail address. (ricsinaruto@hotmail.com)