In this project, I built a neural network model which can describe an image. (more details below)
- I trained with 50k images and 250k captions. (5 captions/image)
- Data source: http://cocodataset.org/#download
- Due to the size of the model, it is not able to be uploaded on this site :(. However, you can download my code.
- Model structure
- InceptionV3 pre-trained model
- Feature map encoder
- Attention
- Decoder
- The title on the image below is the caption generated by the model.
- Attention maps: the dark and bright areas correspond how much they contribute when a specific word is generated.