Skip to content

Data and Code on Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems (SIGIR 2018)

License

Notifications You must be signed in to change notification settings

kamyarghajar/NeuralResponseRanking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuralResponseRanking

This repository contains the implementation of DMN/DMN-PRF/DMN-KD models proposed in SIGIR'18 paper Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. The implementation of DMN/DMN-KD/DMN-PRF models is based on MatchZoo.

If you use this code for your paper, please cite it as:

Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, Haiqing Chen. Response Ranking
with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In Proceedings of the
41th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR 2018).

Bibtext
 @inproceedings{InforSeek_Response_Ranking,
	author = {Yang, L. and Qiu, M. and Qu, C. and Guo, J. and Zhang, Y. and Croft, W. B. and Huang, J. and Chen, H.},
	title = {Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems},
	booktitle = {SIGIR '18},
	year = {2018},
}

Requirements

  • python 2.7
  • tensorflow 1.2+
  • keras 2.06+
  • nltk 3.2.2+
  • tqdm 4.19.4+
  • h5py 2.7.1+

We also recommend to use GPU (NVIDIA TITAIN X for our experiments) for model training efficiency. In general, the model training time for Ubuntu Dialog Corpus is longer than that of MSDialog due to the larger training data size.

Guide To Use

DMN


Data Preparation and Preprocess

We take the Ubuntu Dialog Corpus as the example data to show how to prepare and preprcess the data to run the DMN model for experiments on response ranking in information-seeking covnersations. You can easily adapt these instructions to other data sets like MSDialog or other information-seeking conversation datasets in your lab or company.

  • Step 1: Download the data. you can download the Ubuntu Dialog Corpus(UDC) data from this dropbox link used in several previous papers and our SIGIR'18 paper. The data contain 1M train instances, 500K validation instances and 500K testing instances you need for the response ranking experiments. The data format is as follows:
label  \t   utterance_1   \t   utterance_2    \t     ......    \t    candidate_response

Each line is corresponding to a conversation context/candidate response pair. Suppose there are n_i columns seperated by tab in the i-th line. The first column is a binary label to indicate whether the candidate response is the positive candidate response returned by the agent or the sampled negative candidate response. Then the next (n_i - 2) columns are the utterances in the conversation context including the current input utterance by the user. The last column is the candidate response.

After you downloaded the files, put train.txt/valid.txt/test.txt under NeuralResponseRanking/data/udc/ModelInput/ . The suggested directory structure under NeuralResponseRanking/data/ is as follows:

.
├── ms_v2
│   ├── ModelInput
│   │   ├── dmn_model_input
│   │   └── dmn_prf_model_input_body
│   └── ModelRes
└── udc
    ├── ModelInput
    │   ├── dmn_model_input
    │   └── dmn_prf_model_input_body
    └── ModelRes

where dmn_model_input stores the preprocessed input data of DMN/DMN-KD model and dmn_prf_model_input_body stores the preprocessed input data of DMN-PRF model. ms_v2 refers to MSDialog data and udc refers to Ubuntu Dialog Corpus data.

  • Step 2: Preprocess the data. You can run the follow commands to preprocess the data. The preprocess steps include preparing the data into relation files and corpus files required in MatchZoo toolkit. Then the script will perform word tokenization, word stemming, transferring words into lower cases, computing word statistics like term frequency, filtering words that appear less than 5 times in the whole corpus, transferring words to word indexes and building a word dictionary.
cd matchzoo/conqa/
python preprocess_dmn.py udc

If you pass ms_v2 into preprocess_dmn.py, it can also do preprocessing steps for MSDialog data. You need to download the data in advance.

  • Step 3: Prepare the pre-trained word embedding files. As presented in our SIGIR'18 paper, we used the Word2Vec to pre-train the word embeddings and then update them during model training process. We found the models achieved better performances with pre-trained word embeddings by Word2Vec comparing with using Glove as pre-trained word embeddings. We wrote a python wrapper to call the compiled Word2Vec toolkit to train word embeddings with different dimensions. You can run
cd matchzoo/conqa/
python gen_w2v_mikolov.py udc 0 dmn_model_input

It takes several minutes to train the word embedding file based on the corpus file of UDC data. After that you needs to filter the generated word embedding file by the words in the word dictionary file. We are only interested in word embeddings corresponding to terms left in the corpus after data preprocessing. To achieve this, you can simply run