Implementation of some basic Image Annotation methods (using various loss functions & threshold optimization) on Corel-5k dataset with PyTorch library
Usually, (Corel-5k) is divided into 3 parts: a training set of 4000 images, a validation set of 500 images, and a test set of 499 images. In other words, the total number of images for training is 4500 (18,000 wih fake images) and for validation is 499. (After downloading the Corel-5k, replace its 'images' folder with the corresponding 'images' folder in the 'Corel-5k' folder).
You can see the distribution of some labels below: (total 5000 images)
class name | count |
---|---|
sails | 2 |
orchid | 2 |
butterfly | 4 |
cave | 6 |
... | ... |
cars | 151 |
flowers | 296 |
grass | 497 |
tree | 947 |
sky | 988 |
water | 1120 |
They proposed a multi-label data augmentation method based on Wasserstein-GAN. (The process of ML-WGAN is shown in the picture below). Due to the nature of multi-label images, every two images in a common dataset usually have different numbers and types of labels, therefore, it is impossible to directly use WGAN for multi-label data augmentation. The paper suggested using only one multi-label image at a time since the noise (z) input by the generator can only be approximated by the distribution of that image iteratively. As the generated images only use one original image as the real data distribution, they all have the same number and type of labels and have their own local differences while the overall distributions are similar.
There is a 'DataAugmentation' folder that contains the codes of ML-WGAN, which is similar to the paper "Improved Training of Wasserstein GANs". because of the fact that one original image has to be used as real data distribution, I trained the network for each image individually and generated 3 more images for every original image, which increased the size of the training images to 18,000.
An example of the generated images:
The images below show the structure of these models:
Xception number of trainable parameters: 21,339,692
ResNeXt50 number of trainable parameters: 23,512,644
TResNet-m number of trainable parameters: 29,872,772
ResNet101 number of trainable parameters: 43,032,900
The aforementioned evaluation metrics formulas can be seen below:
Another evaluation metric used for datasets with large numbers of tags is N+:
Note that the per-class measures treat all classes equal regardless of their sample size, so one can obtain a high performance by focusing on getting rare classes right. To compensate this, I also measure overall precision/recall which treats all samples equal regardless of their classes.
To train models in Spyder IDE use the code below:
run main.py --model {select model} --loss-function {select loss function}
Please note that:
-
You should put ResNet101, ResNeXt50, Xception or TResNet in {select model}.
-
You should put BCELoss, FocalLoss, AsymmetricLoss or LSEPLoss in {select loss function}.
Using augmented data, you can train models as follows:
run main.py --model {select model} --loss-function {select loss function} --augmentation
To evaluate the model in Spyder IDE use the code below:
run main.py --model {select model} --loss-function {select loss function} --evaluate
The binary classification loss is generally shown in the image below:
The binary cross entropy (BCE) loss function is one of the most popular loss functions in multi-label classification or image annotation, which is defined as follows for the π-th label:
best model | global-pooling | batch-size | num of training images | image-size | epoch time |
---|---|---|---|---|---|
TResNet-m | avg | 32 | 4500 | 448 * 448 | 135s |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.726 | 0.589 | 0.650 |
testset per-class metrics | 0.453 | 0.385 | 0.416 |
data | N+ |
---|---|
testset | 147 |
The following picture illustrates the MCC formula:
Matthews correlation coefficient calculates the correlation between the actual and predicted labels, which produces a number between -1 and 1. Hence, it will only produce a good score if the model is accurate in all confusion matrix components. MCC is the most robust metric against imbalanced dataset issues.
best model | global-pooling | batch-size | num of training images | image-size | epoch time |
---|---|---|---|---|---|
TResNet-m | avg | 32 | 4500 | 448 * 448 | 135s |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.726 | 0.589 | 0.650 |
testset per-class metrics | 0.453 | 0.385 | 0.416 |
testset per-class metrics + MCC | 0.445 | 0.451 | 0.448 |
data | N+ |
---|---|
testset | 147 |
testset + MCC | 164 |
BCE loss leads to overconfidence in the convolutional model, which makes it difficult for the model to generalize. In fact, BCE loss is low when the model is absolutely sure (more than 80% or 90%) about the presence and absence of the labels. However, as seen in the following picture when the model predicts a probability of 60% or 70%, the loss is lower than BCE.
The focal loss formula for the π-th label is shown in the image below:
best model | global-pooling | batch-size | num of training images | image-size | epoch time | πΎ |
---|---|---|---|---|---|---|
TResNet-m | avg | 32 | 4500 | 448 * 448 | 135s | 3 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.758 | 0.581 | 0.658 |
testset per-class metrics | 0.452 | 0.366 | 0.405 |
testset per-class metrics + MCC | 0.483 | 0.451 | 0.466 |
data | N+ |
---|---|
testset | 139 |
testset + MCC | 162 |
I mentioned here that the distribution of labels in the (Corel-5k) and other annotation datasets is extremely unbalanced. The training set might contain labels that appear only once, as well as labels that appear more than 1,000 times. Unfortunately, due to the nature of annotation datasets, there isn't anything that can be done to overcome this problem.
But, there is another imbalance regarding the number of positive and negative labels in a picture. In simple words, most multi-label pictures contain fewer positive labels than negative ones (for example, each image in the (Corel-5k) dataset contains on average 3.4 positive labels).
1. Asymmetric Focusing
Unlike the focal loss, which considers one πΎ for positive and negative labels, positive and negative labels can be decoupled by taking πΎ+ as the focusing level for positive labels, and πΎβ as the focusing level for negative labels. Due to the fact that we are seeking to emphasize the contribution of positive labels, we usually set πΎβ > πΎ+.
2. Asymmetric Probability Shifting
Asymmetric focusing reduces the contribution of negative labels to the loss when their probability is low (soft thresholding). However, this attenuation is not always sufficient due to the high level of imbalance in multi-label classifications. Therefore, we can use another asymmetric mechanism, probability shifting, which performs hard thresholding on very low probability negative labels, and discards them completely. The shifted probability is defined as π_π = maxβ‘(π β π, 0), where the probability margin π β₯ 0 is a tunable hyperparameter.
In the image below, the asymmetric loss formula for the π-th label can be seen:
best model | global-pooling | batch-size | num of training images | image-size | epoch time | πΎ+ | y- | m |
---|---|---|---|---|---|---|---|---|
TResNet-m | avg | 32 | 4500 | 448 * 448 | 141s | 0 | 4 | 0.05 |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.624 | 0.688 | 0.654 |
testset per-class metrics | 0.480 | 0.522 | 0.500 |
testset per-class metrics + MCC | 0.473 | 0.535 | 0.502 |
data | N+ |
---|---|
testset | 179 |
testset + MCC | 184 |
best model | global-pooling | batch-size | num of training images | image-size | epoch time |
---|---|---|---|---|---|
ResNeXt50 | avg | 32 | 4500 | 224 * 224 | 45s |
data | precision | recall | f1-score |
---|---|---|---|
testset per-image metrics | 0.490 | 0.720 | 0.583 |
testset per-class metrics | 0.403 | 0.548 | 0.464 |
data | N+ |
---|---|
testset | 188 |
The result of the trained model with LSEP loss on one batch of test data:
To resolve this issue, we can either reduce the contribution of negative labels from the loss or increase the contribution of positive labels from the loss.
The loss gradient for positive labels indicates that it only pushes a very small proportion of hard positives to a high probability and ignores a large ratio of semi-hard ones with a medium probability.
The contribution of easy negative labels would decrease more when πΎ is increased, but on the other hand gradients of more positive labels would disappear.
It is found that the loss gradients of negative labels with a large probability (p > 0.9) are very low, indicating that they can be accepted as missing labels.
In order to compare the results, I have tried many experiments including changing the resolution of the images (from 224 * 224 to 448 * 448), changing the global pooling of the convolutional models (from global average pooling to global maximum pooling), etc. Among these experiments, the aforementioned results are the best.
Unfortunately, the data augmentation method (ML-WGAN) did not produce the expected results, and could not improve the overfitting problem.
In this project, I used a variety of CNNs and loss functions without taking label correlations into account. By using methods such as graph convolutional networks (GCNs) or recurrent neural networks (RNNs) that consider the semantic relationships between labels, better results may be obtained.
Y. Li, Y. Song, and J. Luo.
"Improving Pairwise Ranking for Multi-label Image Classification" (CVPR - 2017)
T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L. Zelnik-Manor.
"Asymmetric Loss For Multi-Label Classification" (ICCV - 2021)
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville.
"Improved Training of Wasserstein GANs" (arXiv - 2017)