Generating Tweet-Like Text from Images. Where we are…and where we need to be.
Advancements in task like Image Captioning has been attempted to an extent where mostly visual-linguistic grounding of the image-text pair is leveraged. This includes either generating the textual description of the objects and entities present within the image in constrained manner, or generating detailed description of these entities as a paragraph. But there is still a long way to go towards being able to generate text that is not only semantically richer, but also contains real world knowledge in it. This is a brief description about exploring image2tweet generation through the lens of existing image-captioning approaches.
- Image to caption is well explored.
- Next significant step: To generate semantically rich text from images.
- Attempt to explore generation of tweets from images.
Objective: The objective of this post is to briefly present the utility of different approaches for image captioning, for the task of Image to Tweet generation.
How exactly is this different from conventional captioning?
Generating a specialized text like a tweet, that is not a direct result of visual-linguistic grounding that is usually leveraged in similar tasks, but conveys a message that factors-in not only the visual content of the image (Eg. 1 and 2 in the above image), but also additional real world contextual information associated with the event described within the image as closely as possible as in Eg. 3 above
It would be crucial to assess the complexity of the problem given the inclusion of 2 different modalities, image and text. One way to do this would be to visualize the category-wise projection of the data-set into a suitable sub-space for data. Towards this, PCA and UMAP representations are obtained for both image and text (tweets) from our data-set and compared visually with similar projections from MNIST, an image dataset of handwritten digits 0–9 which has a training set of 60,000 examples, and a test set of 10,000 examples and 20 Newsgroups data-set which is a collection of ~18,000 newsgroup documents from 20 different newsgroups respectively.
Images: As can be observed from figure above (a), the 3D PCA based projection of the MNIST data-set can be clearly visualized to be categorically segregated with different classes arranged in the 3D projection space. Whereas, the images from our data-set that has significant commonality in terms of the visual objects and entities present, can be observed (in the figure (b) above) to significantly overlap in the 3D projection space of the principal components. This clearly indicates the complexity involved in modeling associations with an image data-set this diverse. This could also mean that CNN based features could possibly help model different common yet diverse set of concepts from these images as against a dimensionality reduction technique based approach such as PCA.
Text: When the text from 2 datasets is compared using UMAP, the projection of the 20 Newsgroups data-set in 2D space as can be observed from figure above (a) can be distinctly visualized to a reasonable extent. Whereas, the distinction can be significantly made amongst the tweet categories of our data-set as seen in figure (b) above.
Approaches Explored for Image2TweetGeneration
- Encoder-Decoder based approach
InceptionV3 for image features extraction .
200 dimensional GloVe word embedding .
LSTM base decoder.
- Visual Attention based approach 
InceptionV3 for image features extraction .
GRU based decoder.
Based on Bahdanau Attention .
- Image captioning using Transformers
Based on standard Transformer architecture (6–6 Enc-Dec layers) .
Image features fed to the decoder via Enc-Dec-Attention module.
Standard training procedures are adopted as prescribed by these respective works. Mostly with a static learning rate of 0.001 with ‘adam’ optimizer is employed. The # of epochs that work ranges between 20–35 depending upon the architecture (approach) being used.
The scores obtained for the Encoder-Decoder with Attention based approach are observed to be the most optimal in the current setting. The qualitative analysis for EDwA and Transformer based approaches are depicted below.
The BLEU−1/4 scores obtained for this approach are observed to be highest with 0.37 and 0.15 respectively, amongst the set of approaches involving typical image-text based encoding-decoding frameworks evaluated. Although the domain (category) specific word prediction is observed to be much better as compared to other approaches, there is significant syntactical inconsistencies with the sentence structure being generated. Few examples of the generated sentences using this approach can be observed from the figure above.
The performance observed isn’t as competitive as for the Encoder-Decoder w. visual attention,with average BLEU−1/4 scores as 0.29 and 0.10 respectively. The system made lot of repetitions with the sentences being generated at the time of testing. Also, the predictions overlapped across the categories significantly. Although, the grammatical structure of the sentence that was observed for the generated sentences was relatively more accurate as compared to the ones from other approaches examined,but the overall BLEU score turned out to be low. These can be observed from the 3 examples shown in the figure above.
There are a wide array of complexities that are ob-served for the task of generating twitter style sentences (tweets) from a given image. Firstly, there is significant information about a corresponding likely tweet, that is absent within a given image,that can be encoded or annotated. Secondly, selective dynamic attention over image locations is observed to improve the generation performance,but augmenting real world information is imperative for such a task.
- There are challenges in terms of syntactic and semantic aspects that still need improvement.
- Encoder-Decoder with Attention based mechanism observed to perform optimally in the current setting.
- Transformer based approach would require more data and encoder based training for image features as well.
- Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global vectors for word representation.” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
- Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning. 2015.
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
- Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017): 5998–6008.
This project is done towards fulfillment of the course work requirements for Machine Learning @ IIIT Delhi (Monsoon 2020).
Author: Shivam Sharma
I would like to thank our faculty: Dr. Tanmoy Chakraborty Twitter Facebook and TAs: Vivek Reddy, Shiv Kumar Gehlot, Shikha Singh, Pragya Srivastava, Nirav Diwan, Ishita Bajaj, Chhavi Jain and Aanchal Mongia for their efforts towards making this an amazing course. #MachineLearning2020 @IIITDelhi. Also thanks to Dr. Amitava Das for his guidance for the project.