Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
Choosing Linguistics over Vision to Describe Images
Ankush Gupta, Yashaswi Verma, C. V. Jawahar
International Institute of Information Technology, Hyderabad, India - 500032
{ankush.gupta@research., yashaswi.verma@research., jawahar@}iiit.ac.in
§ “This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the road.” (Kulkarni et al. 2011)
Abstract
In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions. Previous attempts for this task mostly rely on visual clues and corpus statistics, but do not take much advantage of the semantic information inherent in the available image descriptions.
Here, we present a generic method which benefits from all these three sources (i.e. visual clues, corpus statistics and available descriptions) simultaneously, and is capable of constructing novel descriptions. Our approach works on syntactically and linguistically motivated phrases extracted from the human descriptions.
Experimental evaluations demonstrate that our formulation mostly generates lucid and semantically correct descriptions, and significantly outperforms the previous methods on automatic evaluation metrics. One of the significant advantages of our approach is that we can generate multiple interesting descriptions for an image.
Unlike any previous work, we also test the applicability of our method on a large dataset containing complex images with rich descriptions.
1
§ “The person is showing the bird on the street.” (Yang et al. 2011)
§ “Black women hanging from a black tree. Colored man in the tree.” (Li et al. 2011)
§ “An American eagle is perching on a thick rope.” (Ours)
Figure 1: Descriptions generated by four different approaches for an example image from the UIUC Pascal sentence dataset. classifiers and corpus statistics, but do not utilize the semantic information encoded in available descriptions of images. Either they use these descriptions to restrict the set of prepositions/verbs (Kulkarni et al. 2011; Yang et al. 2011;
Li et al. 2011; Yao et al. 2008), or pick one or more complete sentences and transfer them to a test image unaltered (Farhadi et al. 2010; Ordonez, Kulkarni, and Berg
2011). While the former may result in quite verbose and non-humanlike descriptions, in the latter it is very unlikely that a retrieved sentence would be as descriptive of a particular image as a generated one (Kulkarni et al. 2011). This is because a retrieved sentence is constrained in terms of objects, attributes and spatial relationship between objects; whereas a generated sentence can more closely associate the semantics relevant to a given image (Figure 1).
Image descriptions not only contain information about the different objects present in an image, but also tell about their states and spatial relationships. Even for complex images, this information can be easily extracted, hence leveraging the gap between visual perception and semantic grounding.
With this motivation, we present a generative approach that gives emphasis to textual information rather than just relying on computer vision techniques. Instead of using object detectors, we estimate the content of a new image based on its similarity with available images. To minimize the impact of encountering noisy and uncertain visual inputs, we extract syntactically motivated patterns from known descriptions and use only those for composing new descriptions. Extracting dependency patterns from descriptions rather than us-
Introduction
An image can be described either by a set of keywords (Guillaumin et al. 2009; Feng , Manmatha, and Lavrenko 2004;
Makadia, Pavlovic, and Kumar 2010), or by a higher level structure such as a phrase (Sadeghi and Farhadi 2011) or sentence (Aker and