This project is a personal project to develop a computer vision classification model for food images. Thankfully, this means that the FoodVision data is availale as a tensorflow dataset object (To get an overview of all datasets available in tensorflow datasets, go to https://www.tensorflow.org/datasets/overview). There are altogether 101 different classes, some of which look similar - for e.g. Baklava and Apple Pie. We aim to beat the state-of-the-art classification accuracy performance of 77% reported in the DeepFood paper (https://www.researchgate.net/publication/304163308_DeepFood_Deep_Learning-Based_Food_Image_Recognition_for_Computer-Aided_Dietary_Assessment).
Our final model achieved a validation accuracy of 84.4%.
- Preprocess data by ensuring that all images are scaled to the same dimensions with normalized pixel values.
- Develop data loader to feed for model training.
- Experiment with different pre-trained image classification models from Tensorflow as feature extractors.
- Unfreeze 5-10 convolution layers in the best performing feature extractor model and finetune the model for 5 more epochs.
- Model evaluation - ascertain the classes that the model struggles to understand.
git clone https://github.com/tituslhy/FoodVision-Tensorflow
pip -r requirements.txt
The flags of the python3 run are:
- --img: This is the path of the image and is a required argument
- --pplot: Boolean. Defaults to False. If True, the script plots the image and its predicted classifcation
python3 main.py --img "img.jpeg" --plot "True"
- There is a bug when using mixed precision training with EfficientNet. The current fix is to use Tensorflow version 2.4 instead.
- EfficientNet models have an input normalization in-built, so there is no need to scale image pixels between 0-255 during preprocessing.
- Model training and annotations are in under
Notebooks > FoodVision101 Model Development.ipynb
- Though we compare our performance to the DeepFood paper, this is not a like-for-like comparison. Researchers in the DeepFood paper use object detection algorithms to identify multiple objects within the same image. This project is only trained on images with one object and should only be used for single object classification.
We experimented with the following feature extractors before finetuning:
Experiment | Feature extractor | Validation accuracy (on 15% of test data) | Findings |
InceptionResNetV2 with Data Augmentation | InceptionResNetV2: The InceptionResNetV2 model applies residual connections to Google's in-house InceptionV3 model. https://arxiv.org/pdf/1602.07261.pdf | 55% | The use of data augmentation causes the model to severely underfit instead of helping it generalize better. |
InceptionResNetV2 | InceptionResNetV2: The InceptionResNetV2 model applies residual connections to Google's in-house InceptionV3 model. https://arxiv.org/pdf/1602.07261.pdf | 61% | Performance improved significantly, but it's still not great. This could be due to choice of feature extractor model |
EfficientNetB0 | EfficientNetB0: EfficientNet boasts very high accuracy despite very few parameters. The backbone of EfficientNet models is similar to MobileNetV2 in that it uses mobile inverted bottleneck convolution. Where it differs from other models is the 'efficiency' when scaling width, depth, resolution or a scaling combination. (https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html) | 75% | Performance improved significantly with much faster training times than prior experiments. |
EfficientNetB4 | EfficientNetB4: This is a larger version of EfficientNetB0. The thinking behind this experiment is: if some is good, more must be better. | 72% | Performance is slightly lower with much slower training times. |
EfficientNetV2B0 | EfficientNetV2B0: EfficientNetV2 uses the concept of progressive learning. This means that although the image sizes are originally small when the training starts, they increase in size progressively. Scaling and regularization are done dynamically throughout model training. (https://towardsdatascience.com/google-releases-efficientnetv2-a-smaller-faster-and-better-efficientnet-673a77bdd43c) | 75.5% | Performance for this experiment is the best. We therefore choose to finetune this experiment further |
EfficientNetV2B0 Finetuned | EfficientNetV2B0 with all layers set to trainable. | 84.6% | We successfully exceed the target we set ourselves! |