Relational Reasoning using Prior Knowledge for Visual Captioning

by   Jingyi Hou, et al.

Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object/relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.


Exploring Explicit and Implicit Visual Relationships for Image Captioning

Image captioning is one of the most challenging tasks in AI, which aims ...

Hybrid Knowledge Routed Modules for Large-scale Object Detection

The dominant object detection approaches treat the recognition of each r...

Classification by Attention: Scene Graph Classification with Prior Knowledge

A main challenge in scene graph classification is that the appearance of...

A-CAP: Anticipation Captioning with Commonsense Knowledge

Humans possess the capacity to reason about the future based on a sparse...

Hybrid Reasoning Network for Video-based Commonsense Captioning

The task of video-based commonsense captioning aims to generate event-wi...

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

Human action analysis and understanding in videos is an important and ch...

Video Captioning Using Weak Annotation

Video captioning has shown impressive progress in recent years. One key ...

Please sign up or login with your details

Forgot password? Click here to reset