GitHub - hengluchang/Quora-Paraphrase-Question-Identification: Paraphrase question identification using Feature Fusion Network (FFN).

Paraphrase Question Identification using Feature Fusion Network

Identify question pairs that have the same meaning. Feature Fusion Network takes advantage of learning rich features not just from sentence representations but also from hand craft features.

For more detailed information, please see our project research paper: Paraphrase Question Identification using Feature Fusion Network.

Model architecture

Results

0.895 testing accuracy for FFN (train for 100 epoch)

Requirements

Python 3.5 for running FFN
Python 2.7 for running Random Forest (RF) baseline

Package dependencies

RF baseline

scikit-learn 0.18
nltk
pandas

FFN

numpy 1.11
matplotlib 1.5
Keras 1.2
scikit-learn 0.18
h5py 2.6
hdf5 1.8
TensorFlow 0.10

How to run

$ git clone https://github.com/hengluchang/Quora-Paraphrase-Question-Identification

Run Random Forest baseline

create a folder named "dataset".

$ cd Quora-Paraphrase-Question-Identification
$ mkdir -p dataset

Go to Kaggle Quora Question Pairs website and download train.csv.zip and test.csv.zip and unzip both. Place the train.csv and test.csv under /dataset directory.
Create 10 Hand crafted features (HCFs). This will create train_10features.csv and test_10features.csv under /dataset directory.

$ cd ..
$ python feature_gen.py ../dataset/train.csv ../dataset/test.csv

Run Random Forest baseline on these 10 HCFs, this will give you ~ 0.84 testing accuracy.

$ python run_baseline.py ../dataset/train_10features.csv

Run Feature Fusion Network (FFN)

Download the required data here(Google Drive link) to the directory you clone
Train FFN w/o HCF

$ pyhon3 train_noHCF.py -i <QUESTION_PAIRS_FILE> -t <TEST_QUESTION_PAIRS_FILE> -g <GLOVE_FILE> -w <MODEL_WEIGHTS_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE>

For instance:

$ python3 train_noHCF.py -i train_rebalanced.csv -t test.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5  -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json

Train FFN

$ python3 train_HCF.py -i <QUESTION_PAIRS_FILE> -t <TEST_QUESTION_PAIRS_FILE> -f <HCF_FILE> -g <GLOVE_FILE> -w <MODEL_WEIGHTS_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE>

For instance:

$ python3 train_HCF.py -i train_rebalanced.csv -t test.csv -f train_rebalanced_10features.csv -g glove.840B.300d.txt -w question_pairs_weights_100epoch_test10_val20_dropout20_sumOP_noAVG_HCF_rebalanced.h5  -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json

Test FFN w/o HCF

$ python3 test_noHCF.py -i <QUESTION_PAIRS_FILE> -o <RESULT_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE> -w <MODEL_WEIGHTS_FILE>

For instance:

$ python3 test_noHCF.py -i test.csv  -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_rebalanced.h5

Test FFN

$ python3 test_HCF.py -i <QUESTION_PAIRS_FILE> -o <RESULT_FILE> -e <WORD_EMBEDDING_MATRIX_FILE> -n <NB_WORDS_DATA_FILE> -w <MODEL_WEIGHTS_FILE>

For instance:

$ python3 test_sum_HCF.py -i test.csv -f -test_10features.csv -o result_question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.csv -e word_embedding_matrix_trainANDtest_rebalanced.npy -n nb_words_trainANDtest_rebalanced.json -w question_pairs_weights_100epoch_test10_val10_dropout20_sumOP_noAVG_HCF_rebalanced.h5

Reference

Keras model to identify Quora question pairs: borrowed most of the Deep Neural Network script

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitignore		.gitignore
FFN_architecture.jpg		FFN_architecture.jpg
README.md		README.md
feature_gen.py		feature_gen.py
mymodel.py		mymodel.py
paraphrase-question-identification.pdf		paraphrase-question-identification.pdf
run_baseline.py		run_baseline.py
test_FFN.py		test_FFN.py
test_FFN_noHCF.py		test_FFN_noHCF.py
train_FFN.py		train_FFN.py
train_FFN_noHCF.py		train_FFN_noHCF.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paraphrase Question Identification using Feature Fusion Network

Model architecture

Results

Requirements

Package dependencies

RF baseline

FFN

How to run

Run Random Forest baseline

Run Feature Fusion Network (FFN)

Reference

About

Releases

Packages

Contributors 2

Languages

hengluchang/Quora-Paraphrase-Question-Identification

Folders and files

Latest commit

History

Repository files navigation

Paraphrase Question Identification using Feature Fusion Network

Model architecture

Results

Requirements

Package dependencies

RF baseline

FFN

How to run

Run Random Forest baseline

Run Feature Fusion Network (FFN)

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages