Better Text Understanding Through Image-To-Text Transfer

by   Karol Kurach, et al.

Generic text embeddings are successfully used in a variety of tasks. However, they are often learnt by capturing the co-occurrence structure from pure text corpora, resulting in limitations of their ability to generalize. In this paper, we explore models that incorporate visual information into the text representation. Based on comprehensive ablation studies, we propose a conceptually simple, yet well performing architecture. It outperforms previous multimodal approaches on a set of well established benchmarks. We also improve the state-of-the-art results for image-related text datasets, using orders of magnitude less data.


page 1

page 4


Learning to Learn from Web Data through Deep Semantic Embeddings

In this paper we propose to learn a multimodal image and text embedding ...

Self-Supervised Learning from Web Data for Multimodal Retrieval

Self-Supervised learning from multimodal image and text data allows deep...

Deep Transfer Reinforcement Learning for Text Summarization

Deep neural networks are data hungry models and thus they face difficult...

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

As an important task in multimodal context understanding, Text-VQA (Visu...

MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction

Automatic assessment of the quality of scholarly documents is a difficul...

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Semantic embeddings have advanced the state of the art for countless nat...

Please sign up or login with your details

Forgot password? Click here to reset