PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

06/01/2023
by   Hejie Cui, et al.
Amazon
Emory University
University of California, San Diego
4

Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74 increase over unimodal baselines.

READ FULL TEXT

page 4

page 8

page 13

03/07/2022

Multi-Modal Attribute Extraction for E-Commerce

To improve users' experience as they navigate the myriad of options offe...
12/21/2021

Extending CLIP for Category-to-image Retrieval in E-commerce

E-commerce provides rich multimodal data that is barely leveraged in pra...
02/20/2023

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multimodal summarization (MS) aims to generate a summary from multimodal...
05/17/2022

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...
11/29/2017

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...
12/18/2020

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...
06/15/2019

Joint Visual-Textual Embedding for Multimodal Style Search

We introduce a multimodal visual-textual search refinement method for fa...

Please sign up or login with your details

Forgot password? Click here to reset