Fusion Models for Improved Visual Captioning

by   Marimuthu Kalimuthu, et al.

Visual captioning aims to generate textual descriptions given images. Traditionally, the captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them to often make mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with an aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.


page 13

page 14

page 15

page 16

page 17

page 18


Retrieval-augmented Image Captioning

Inspired by retrieval-augmented language generation and pretrained Visio...

Oracle performance for visual captioning

The task of associating images and videos with a natural language descri...

PromptCap: Prompt-Guided Task-Aware Image Captioning

Image captioning aims to describe an image with a natural language sente...

Enhanced Modality Transition for Image Captioning

Image captioning model is a cross-modality knowledge discovery task, whi...

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generat...

VICSOM: VIsual Clues from SOcial Media for psychological assessment

Sharing multimodal information (typically images, videos or text) in Soc...

Understanding Guided Image Captioning Performance across Domains

Image captioning models generally lack the capability to take into accou...

Please sign up or login with your details

Forgot password? Click here to reset