This project aims to predict the display names of fashion products from images using two distinct approaches. The dataset comprises fashion product images and their attributes, such as category, color, season, etc. The goal is to convert these images into descriptive display names.
The dataset used for this project is the Fashion Product Images Dataset from Kaggle. It includes:
- Images of fashion products.
- Attributes for each image, including category, color, brand, and season (Classification of These Attributes).
- Display Names which are the target labels we aim to predict.
For the approaches below, I utilize a pre-trained model developed for multi-label classification of fashion products.
You can find the details and code for this model in the Fashion Product Multilabel Classification repository.
Implemented two approaches to tackle the image-to-text prediction problem:
- Model: Utilizes the pre-trained classification model from the Fashion Product Multilabel Classification repository as a feature extractor. An additional RNN (LSTM) head is added to directly predict the display name from the image features.
- Implementation: Kaggle Notebook - Approach One
- Performance:
- Average BLEU Score: 0.8994
- Average ROUGE-1 F1 Score: 0.9532
- Average ROUGE-2 F1 Score: 0.9394
- Average ROUGE-L F1 Score: 0.9532
- Example:
-
Segment One: Attribute Classification
- Model: Fine-tuned ResNet-50 model for classifying various attributes of fashion images such as category, base color, brand, and season.
- Output: The predicted classes for each attribute.
-
Segment Two: Display Name Prediction
- Inputs:
- The class predictions from Segment One.
- The encoded image features from the ResNet-50 model.
- Model: An RNN (LSTM) model that takes these inputs and predicts the display name of the product.
- Implementation: Kaggle Notebook - Approach Two
- Example:
- Inputs:
Note: The first approach achieved better accuracy in fewer epochs than the second approach.