Beyond Visual Semantics: Exploring the Role of Scene Text in Image Understanding

by   Arka Ujjal Dey, et al.

Images with visual and scene text content are ubiquitous in everyday life. However current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We undertake the task of matching Advertisement images against their human generated statements that describe the action that the ad prompts and the rationale it provides for taking this action. We extract the scene text and generate semantic and lexical text representations, which are used in the interpretation of the Ad Image. To deal with irrelevant or erroneous detection of scene text, we use a text attention scheme. We also learn an embedding of the visual channel, visual features based on detected symbolism and objects, into a semantic embedding space, leveraging text semantics obtained from scene text. We show how the multi channel approach, involving visual semantics and scene text, improves upon the current state of the art.


VITAL: A Visual Interpretation on Text with Adversarial Learning for Image Labeling

In this paper, we propose a novel way to interpret text information by e...

Don't only Feel Read: Using Scene text to understand advertisements

We propose a framework for automated classification of Advertisement Ima...

Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention

We tackle the problem of understanding visual ads where given an ad imag...

Scene Text Recognition with Image-Text Matching-guided Dictionary

Employing a dictionary can efficiently rectify the deviation between the...

Looking Beyond a Clever Narrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements

Emotion evoked by an advertisement plays a key role in influencing brand...

Visual Noise from Natural Scene Statistics Reveals Human Scene Category Representations

Our perceptions are guided both by the bottom-up information entering ou...

Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings

To what extent are two images picturing the same 3D surfaces? Even when ...

Please sign up or login with your details

Forgot password? Click here to reset