In this project, we worked on both, the flickr_8k and flickr_30k, datasets but we had some storage and runtime complications with the flickr_30k dataset.
We used the encoder-decoder model to create our image caption generator, with the encoder as a CNN network and the decoder as an LSTM network.
Datasets can be found here:
flickr_8k: https://www.kaggle.com/datasets/waelboussbat/flickr8ksau
flickr_30k: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
More details can be found in the report and/or presentation.