ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

by   Xinyu Wang, et al.

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose Image-text Alignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first locally and globally aligns regional object tags and image-level captions as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.


page 3

page 10


A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments

Mining structured knowledge from tweets using named entity recognition (...

Named Entity and Relation Extraction with Multi-Modal Retrieval

Multi-modal named entity recognition (NER) and relation extraction (RE) ...

Flat Multi-modal Interaction Transformer for Named Entity Recognition

Multi-modal named entity recognition (MNER) aims at identifying entity s...

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Vision-language alignment learning for video-text retrieval arouses a lo...

M3PT: A Multi-Modal Model for POI Tagging

POI tagging aims to annotate a point of interest (POI) with some informa...

MultiQG-TI: Towards Question Generation from Multi-modal Sources

We study the new problem of automatic question generation (QG) from mult...

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm is a linguistic phenomenon indicating a discrepancy between lite...

Please sign up or login with your details

Forgot password? Click here to reset