I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

09/21/2022
by   Muhammad Ferjad Naeem, et al.
0

Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions.

READ FULL TEXT

page 3

page 9

research
03/20/2022

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Human-annotated attributes serve as powerful semantic embeddings in zero...
research
05/31/2021

Pho(SC)Net: An Approach Towards Zero-shot Word Image Recognition in Historical Documents

Annotating words in a historical document image archive for word image r...
research
11/28/2016

Gaze Embeddings for Zero-Shot Image Classification

Zero-shot image classification using auxiliary information, such as attr...
research
12/20/2022

Precise Zero-Shot Dense Retrieval without Relevance Labels

While dense retrieval has been shown effective and efficient across task...
research
04/05/2016

Less is more: zero-shot learning from online textual documents with noise suppression

Classifying a visual concept merely from its associated online textual s...
research
04/21/2021

Revisiting Document Representations for Large-Scale Zero-Shot Learning

Zero-shot learning aims to recognize unseen objects using their semantic...
research
01/11/2023

EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

We learn a visual representation that captures information about the cam...

Please sign up or login with your details

Forgot password? Click here to reset