Convolutional Neural Network for Text Classification in Keras
This is a Keras implementation of Yoon Kim's paper Convolution Neural Networks for Sentence Classification with the addition that this code also works for the Glove vectors and Fasttext vectors.
- numpy
- keras
- cPickle
-
Download the pre-trained Google
word2vec
word embedding vectors as a binary file from here -
Pre-process the text data
from text_processing_util import TextProcessing
tp = TextProcessing(texts, labels, EMBEDDING_DIM, MAX_SEQUENCE_LENGTH, MAX_NB_WORDS, VALIDATION_SPLIT)
where
- texts: a list of sentences.
- labels: a list of labels corresponding to the sentences in the list texts.
- MAX_SEQUENCE_LENGTH: maximum length of the sentence to be considered, longer sentences will be terminated at this length.(default is 50)
- MAX_NB_WORDS: maximum number of words to be used in the model (default is 10000).
- EMBEDDING_DIM: dimension of the word vectors (default is 300 for google word2vec).
- VALIDATION_SPLIT: fraction of data to be used for validation. (default is 0.2).
- Split into train and test data.
x_train, y_train, x_val, y_val, word_index = tp.preprocess()
- Build the embeddings index.
embeddings_index = tp.build_embedding_index_from_word2vec(path_to_wordvec_file, word_index)
- Serialize the data after the processing.
import cPickle
cPickle.dump([word_index, embeddings_index], open('tokenization_and_embedding.p', 'wb'))
- Get labels index.
labels_index = tp.labels_index
- Build the CNN model
from text_cnn import kimCNN
model = kimCNN(EMBEDDING_DIM, MAX_SEQUENCE_LENGTH, MAX_NB_WORDS, embeddings_index, word_index, labels_index=labels_index)
- Fit the model
model.fit(x=x_train, y=y_train, batch_size=50, epochs=25 , validation_data=(x_val, y_val))
For a detailed example see example.py
. This is the same example used in Kim's paper and the original theano code.