Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting for Marketing

05/06/2019
by   Philipp Harzig, et al.
0

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of the captions. In our setting of images depicting persons interacting with branded products, the subject, predicate, object and the name of the branded product are important evaluation criteria of the generated captions. Generating image captions with these constraints is a new challenge, which we tackle in this work. By simultaneously predicting integer-valued ratings that describe attributes of the human-product interaction, we optimize a deep neural network architecture in a multi-task learning setting, which considerably improves the caption quality. Furthermore, we introduce a novel metric that allows us to assess whether the generated captions meet our requirements (i.e., subject, predicate, object, and product name) and describe a series of experiments on caption quality and how to address annotator disagreements for the image ratings with an approach called soft targets. We also show that our novel clause-focused metrics are also applicable to other image captioning datasets, such as the popular MSCOCO dataset.

READ FULL TEXT

page 1

page 5

research
03/26/2020

Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models

Image captioning models have been able to generate grammatically correct...
research
02/06/2018

Multimodal Image Captioning for Marketing Analysis

Automatically captioning images with natural language sentences is an im...
research
07/31/2020

Evaluating Automatically Generated Phoneme Captions for Images

Image2Speech is the relatively new task of generating a spoken descripti...
research
12/22/2016

Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

We introduce a new multi-modal task for computer systems, posed as a com...
research
01/14/2019

Image Based Review Text Generation with Emotional Guidance

In the current field of computer vision, automatically generating texts ...
research
11/21/2019

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Human ratings are currently the most accurate way to assess the quality ...
research
12/24/2020

WEmbSim: A Simple yet Effective Metric for Image Captioning

The area of automatic image caption evaluation is still undergoing inten...

Please sign up or login with your details

Forgot password? Click here to reset