PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

by   Hejie Cui, et al.
Emory University
University of California, San Diego

Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74 increase over unimodal baselines.


page 4

page 8

page 13


Multi-Modal Attribute Extraction for E-Commerce

To improve users' experience as they navigate the myriad of options offe...

Extending CLIP for Category-to-image Retrieval in E-commerce

E-commerce provides rich multimodal data that is barely leveraged in pra...

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...

Joint Visual-Textual Embedding for Multimodal Style Search

We introduce a multimodal visual-textual search refinement method for fa...

Please sign up or login with your details

Forgot password? Click here to reset