December 2016 - This project and the project materials are built with 3 other excellent teammates; Safa Messaoud, Ankit Rai and Tarek Elgamal.
In this project, we built a pipeline that consists of two models: The first model, Show and Tell (Im2Txt) [1], generates a caption from a given image. The second model, Txt2Im [2], produces an image from a given caption. In addition to training and implementing these two models independently, by combining them, we were also able to compare a given image with the pipeline-generated one and comment on the information conveyed by a caption of just a few words.
At this point with available resources, it is not expected to obtain a perfect match between these two images. If not impossible, this task would require significantly more computational power, data and manpower for exquisite modeling than that's available to us. However, the futuristic idea of perfect reconstruction is very exciting. Imagine storing or transporting images as text, taking up a tiny fraction of memory or bandwidth, and then generating them from those "captions" as necessary.
There had been vast improvements in machine translation applications with the use of encoder-decoder RNNs. An encoder RNN is used to convert a sentence (or phrase) in one language to a rich, fixed-length vector representation, and then a decoder RNN is used to reconstruct the corresponding sentence (or phrase) vector in the target language. In the Show-and-Tell caption generating model, the encoder RNN is replaced with a CNN, and an image is used as the input to be encoded instead of a sentence (phrase). CNN is trained for an image classification task and the last hidden layer, which would be a rich representation of the input image, is fed into the decoder RNN to generate captions.
RNNs (LSTMs to be more specific) can generate new sequences that are representative of their existing training sequences. In many applications, these training and generated sequences are texts. In others, these can be images. If images are represented as patches drawn sequentially, one can use a generative LSTM drawing a sequence of patches from the sequence of text in the caption as the input, just as proposed in alignDRAW model in [2].
In this project, we combined these two models end-to-end. We trained the models in Flickr8k dataset. Since both of the models in those papers were trained on MSCOCO dataset, we needed to change the structure of the networks and train them from scratch (on top of the pre-trained CNN model) on Flickr8k. We trained two models separately and independently, but on the same dataset. Then we fed the caption generated by the first (caption-generating) model to the second (image-drawing) model.
Some captions generated by im2txt model trained on Flickr8k dataset: Top - captions with high confidence scores (good captions), bottom - captions with low confidence scores (bad captions)
Some example results of the pipeline. Left images are the ones fed into the trained im2txt network, middle text is the captions generated by im2txt, and on the right are the images generated by the (not-so-well) trained txt2im network.
Although the networks are trained on a relatively small dataset (Flicker8k with 8k images) for a relatively low amount of time (especially the generative network, which is trained for just 6 epochs), the results are somehow related. The generative network seem to get the color cues. In the first example, the generated images have more blue background, which probably corresponds to "water" in the caption, with dark spots in the center, which probably corresponds to "brown dog". In the second, similarly, most of the generated images has greenish background with dark blobs around the center.
There are a few things that could have been done differently if time (and computational power) allowed us to do so:
- We trained the networks independently. This prevented forward and back-propagation going from one end to the other, and hence prevented joint optimization. An interesting approach would be training the networks jointly by backpropagating the loss between the input and the output images.
- We fed the caption generated from one network as the input to the other. Instead of feeding the caption, we could add some layers of abstraction, e.g. feed the hidden layers of the first network into the second layer of generative RNN in the second network. It would be interesting to see the trade-off of layers of abstraction vs performance in terms of image relevancy.
- And of course, training for a longer time, preferably on a larger dataset.
More details of the project can be found in this presentation and in this report
- Vinyals, O., Toshev, A., Bengio, S. and Erhan, D., 2015. Show and Tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3156-3164).
- Mansimov, E., Parisotto, E., Ba, J.L. and Salakhutdinov, R., 2015. Generating images from captions with attention. arXiv:1511.02793.
- Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv:1411.5654, 2014.
- Bahdanau et al. ”Neural machine translation by jointly learning to align and translate.” arXiv:1409.0473 (2014).
- Xu, Kelvin, et al.”Show, attend and tell:Neural image caption generation with visual attention.” arXiv:1502.03044 2.3 (2015): 5.
- Andrej Karpathy, Li Fei-Fei. ”Deep Visual-Semantic Alignments for Generating Image Descriptions.” CVPR 2015.