Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

by   Shuangjie Xu, et al.

In this paper, we investigate a novel problem of telling the difference between image pairs in natural language. Compared to previous approaches for single image captioning, it is challenging to fetch linguistic representation from two independent visual information. To this end, we have proposed an effective encoder-decoder caption framework based on Hyper Convolution Net. In addition, a series of novel feature fusing techniques for pairwise visual information fusing are introduced and a discriminating referee is proposed to evaluate the pipeline. Because of the lack of appropriate datasets to support this task, we have collected and annotated a large new dataset with Amazon Mechanical Turk (AMT) for generating captions in a pairwise manner (with 14764 images and 26710 image pairs in total). The dataset is the first one on the relative difference caption task that provides descriptions in free language. We evaluate the effectiveness of our model on two datasets in the field and it outperforms the state-of-the-art approach by a large margin.


page 1

page 2

page 3

page 4


Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model

Image Captioning is an arduous task of producing syntactically and seman...

Expressing Visual Relationships via Language

Describing images with text is a fundamental problem in vision-language ...

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...

Image to Language Understanding: Captioning approach

Extracting context from visual representations is of utmost importance i...

Bangla language textual image description by hybrid neural network model

Automatic image captioning task in different language is a challenging t...

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Change Captioning is a task that aims to describe the difference between...

Neural Naturalist: Generating Fine-Grained Image Comparisons

We introduce the new Birds-to-Words dataset of 41k sentences describing ...

Please sign up or login with your details

Forgot password? Click here to reset