The repository contains a neural network, which can automatically generate captions from images.
The solution architecture consists of:
- CNN encoder, which encodes the images into the embedded feature vectors:
These are some of the outputs give by the network using the COCO dataset: