- Text classification models
- Sogou corpus
- fastText
- RNNs
- CNNs
- HAN
- Discussion
- Sentiment analysis
- Unbalanced dataset
- Averaged word (with n-grams) vectors + softmax [1].
- Just like Continuous BOW model, where the center word is replaced by the label now.
- In another view, fastText likes CNN configured with window size =
1
(unigram) orn
(n-grams) and average pooling [2]. - When training word vectors, fastText use subword n-grams information
- When training text classification, fastText both has subword n-grams (i.e.,
minn
andmaxn
) and word n-grams (i.e.,wordNgrams
) parameters
- Fast: hierarchical softmax
- Huffman tree.
- Reduce computational complexity from
O(k*h)
toO(log(k)*h)
, wherek
is the number of categories andh
is the hidden dimension. - At test time, each node has a probability. DFS and track maximum probability give top-1 prediction. With binary heep, top-T predictions are calculated at the cost of
O(log(T))
.
- Accuracy: n-grams with hashing trick
- Incorporate word order information. Higher order grams (i.e., bigrams, trigrams, n-grams=5) performs better.
- 10M bins for bigrams, and 100M for n-grams [1].
- These n-grams in the same bin share the embedding vector [3].
- Incorporate POS information [from Alibaba meeting].
[1] Bag of Tricks for Efficient Text Classification
[2] https://www.zhihu.com/question/48345431
[3] http://albertxiebnu.github.io/fasttext/
LSTM_text_classification_version_1.ipynb
. See Chinese notes, 中文解读.LSTM_text_classification_version_2.ipynb
. See Chinese notes, 中文解读.- Concatenate character features and word features together to feed to FC.
- To be done: LSTM + Attention, Bidirectional LSTM + Attention
Openai's work that finds a sentiment neuron. The model consists of unsupervised language model + Logistic regression with L1 regularization.
[1] https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-HATN/
[2] http://tobiaslee.top/2017/08/29/Attention-based-LSTM-for-Text-Classification/
[3] https://www.cloudsek.com/announcements/blog/hierarchical-attention-text-classification/
- Converting the sentence to a tensor of shape [height=seq_len, width=embedding_size, channels=1] with word embedding.
- Convolution and max-pooling on the tensor.
- Fully-connected with softmax.
- Input:
(batch_size, height=seq_length, width=embedding_size, channels=1)
.tf.nn.embedding_lookup
,tf.expand_dims
- for f in filter_sizes:
- Convolution
tf.nn.conv2d
- Conv - add bias - ReLU
- Filter:
(filter_height=f, filter_width=embedding_size, in_channels=1, out_channels=num_filters)
- Output tensor:
(batch_size, seq_length-filter_size+1 (stride=1), 1, num_filters)
- Max-pool
tf.nn.max_pool
- ksize:
[1, seq_length-filter_size+1, 1, 1]
- Output tensor:
(batch_size, 1, 1, num_filters)
- ksize:
- Convolution
- Concatenate output tensor for each filter_size to
(batch_size, 1, 1, len(filter_sizes)*num_filters)
andtf.reshape
to(batch_size, len(filter_sizes)*num_filters)
- FC1 with drop-out
(batch_size, len(filter_sizes)*num_filters)
- FC2
(batch_size, num_classes)
CNN_text_classification.ipynb
. See Chinese notes, 中文解读.
[1] http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
[2] https://github.com/gaussic/text-classification-cnn-rnn
- To do
- To do
4. HAN
[1]
5. BERT
- Pre-training
- Within-task and in-domain further can significantly boost its performance
- Fine-tuning
- Features from different layers, the top layer of BERT is more useful for text classification
- With an appropriate layer-wise decreasing learning rate, BERT can overcome the catastrophic forgetting problem
- Long text
- Truncation methods
- Head-only: keep the first 510 tokens
- Tail-only: keep the last 510 tokens
- Head+tail: empirically select the first 128 and the last 382 tokens
- Hierarchical methods
- The input text is firstly divided into k = L/510 fractions, which is fed into BERT to obtain the representation of the k text fractions. The representation of each fraction is the hidden state of the [CLS] tokens of the last layer. Then we use mean pooling, max pooling and self-attention to combine the representations of all the fractions
- Truncation methods
- Official implementation
- For long sentence, CNNs is better than RNNs [1].
- Long-term dependency is not significant for text classification problem [2].
- Recursive NNs incorporate syntax information.
- Tricks
- Normalization
- Dynamic max-pooling
- We can calling free api (i.e., ai.baidu.com) to build trainging data, see an example
[1] https://www.zhihu.com/question/41625896
[2] https://hanxiao.github.io/2018/06/25/4-Encoding-Blocks-You-Need-to-Know-Besides-LSTM-RNN-in-Tensorflow/?from=timeline&isappinstalled=0
- Data
- Up-sampling of minor class
- Replication same samples
- Synthesis by SMOTE
- Data augmentation by replacing verb or adjective
- Down-sampling of major class
- First assign the sample weights as
1 / class_count
, then sample by multinormal distribution
- Up-sampling of minor class
- Model
- Ensemble
- Bagging and Boosting
- One-class SVM
- Ensemble
- Loss function
- Class_weight is used as parameters to weight loss, weight less frequent classes higher than very frequent classes. How to add class weight to
nn.CrossEntropyLoss
from sklearn.utils import class_weight class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
# multi-class fl = -(1-pt)^gamma * logpt
- Class_weight is used as parameters to weight loss, weight less frequent classes higher than very frequent classes. How to add class weight to
- Evaluation metric
- Default decision threshold is not valid for imbalanced data
- AUC gives performance across the whole range of decision thresholds, ROC curves are typically used in binary classification
- TPR (True Positive Rate) = # True positives / # positives = Recall = TP / (TP+FN)
- FPR (False Positive Rate) = # False Positives / # negatives = FP / (FP+TN)
- Precision =# True positives / # predicted positive = TP/(TP+FP)
- Recall = # True positives / # positives = TP / (TP+FN)
- Recall = True Positive Rate
[1] https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/
[2] https://www.reddit.com/r/MachineLearning/comments/12evgi/classification_when_80_of_my_training_set_is_of/