by Yuming Yang
This project is to use a CNN-DCNN model to do semi-supervised or supervised binary text classification on Chinese text.
It can be viewed as a Keras based implementation of the classification model in the paper ("Deconvolutional Paragraph Representation Learning" by Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, Ricardo Henao and Lawrence Carin, NIPS 2017). Note that there are some differences on the layer settings and loss function which make the model easier to train and fit the Chinese text.
I have implemented both a baseline purely CNN model and a semi-supervised CNN_DCNN model, you can separately train and test on one model by selecting model type.
After some modification on input and layer settings, this project could be used to work on many other tasks like text summarization and paragraph reconstruction.
- Train your selected model by providing pos.txt and neg.txt and save the model. See usage in train mode.
- Do predictions on your test text based on your trained model. See usage in test mode.
- You can also tune the parameters in the tuning mode. So far, it only supports hidden layer size tuning in CNN. You can add more parameters to tune by modifying the function model_evaluate in DCNN.py easily
(If you just want to see an example, try directly do predictions using my trained model and test data. See Sample Usage for testing for datails)
python3 DCNN.py \
-m train or test \
-k an integer for number of folds in cross validation for choosing hyperparameters \
-s hidden layer size \
-d /model_file/path/to/dict.p \
for train, need:
-i1 /raw_data/path/to/pos.txt \
-i2 /raw_data/path/to/neg.txt \
-o /model_file/path/to/output_model \
-e epochs
for test, need:
-a /model_file/path/to/trained_model \
-t /raw_data/path/to/test.txt \
-o /output_file/path/to/output_predictions \
for tuning, need:
-i1 /raw_data/path/to/pos.txt \
-i2 /raw_data/path/to/neg.txt \
-ps parameter sets (array of hidden_size you want to try)
for training:
python DCNN.py \
-m train \
-i1 C:\\Users\\Umean\\Desktop\\pos.txt \
-i2 C:\\Users\\Umean\\Desktop\\neg.txt \
-mt DCNN \
-o C:\\Users\\Umean\\Desktop\\DCNN_model.h5
for testing(predicting):
python DCNN.py \
-m test \
-t C:\\Users\\Umean\\Desktop\\test.txt \
-a DCNN_model.h5 \
-mt DCNN \
-o C:\\Users\\Umean\\Desktop\\predictions.txt
for tuning:
python DCNN.py \
-m tuning \
-mt DCNN \
-i1 C:\\Users\\Umean\\Desktop\\pos.txt \
-i2 C:\\Users\\Umean\\Desktop\\neg.txt \
-ps [100, 300, 500]
max_length = 20
max_words = 5000
embed_size = 300
filter_size = 300 (number of filters)
strides = 2
window_size = 4 (filter shape)