Text Siamese Network provides a CNN based implementation of Siamese Network to solve Quora duplicate questions identification problem. Quora question pair dataset has ~400k question pairs along with a binary label which states whether a pair of questions are similar or dissimilar. The Siamese Network based tries to capture the semantic similarity between questions.
- Python 3
- Pip 3
- Tensorflow
- FastText
- faiss
Execute requirements.txt to install dependency packages
pip install -r requirements.txt
- Quora questions dataset is provided in ./data_repository directory.
- To train
python train_siamese_network.py
Open Prediction.ipynb using Jupyter Notebook to look into Prediction module.
Given Question: "Is it healthy to eat egg whites every day?" most similar questions are as follows:
- is it bad for health to eat eggs every day
- is it healthy to eat once a day
- is it unhealthy to eat bananas every day
- is it healthy to eat bread every day
- is it healthy to eat fish every day
- what high protein foods are good for breakfast
- how do you drink more water every day
- what will happen if i drink a gallon of milk every day
- is it healthy to eat one chicken every day
- is it healthy to eat a whole avocado every day
Due to limitation in max file size in git, I haven't uploaded trained model in git. You can download pre-trained model from here and unzip and paste pre-trained model to "./model_siamese_network" directory.
To train on a different dataset, you have to build a dataset consisting of similar and dissimilar text pairs. Empirically, you need to have at least ~200k number of pairs to achieve excellent performance. Try to maintain a balance between similar and dissimilar pairs [50% - 50%] is a good choice.