Multimodal Neurons in Pretrained Text-Only Transformers

08/03/2023
by   Sarah Schwettmann, et al.
0

Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.

READ FULL TEXT

page 1

page 2

page 4

page 8

page 9

page 10

page 12

page 15

research
11/19/2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...
research
06/27/2023

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...
research
06/15/2023

Rosetta Neurons: Mining the Common Units in a Model Zoo

Do different neural networks, trained for various vision tasks, share so...
research
04/11/2023

MoMo: A shared encoder Model for text, image and multi-Modal representations

We propose a self-supervised shared encoder model that achieves strong r...
research
08/18/2022

VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT ...
research
01/31/2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Real-world data contains a vast amount of multimodal information, among ...
research
11/14/2019

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Many visual scenes contain text that carries crucial information, and it...

Please sign up or login with your details

Forgot password? Click here to reset