VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

10/07/2020
by   Soyeon Caren Han, et al.
0

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.

READ FULL TEXT

page 7

page 8

page 9

research
08/09/2023

TextPainter: Multimodal Text Image Generation withVisual-harmony and Text-comprehension for Poster Design

Text design is one of the most critical procedures in poster design, as ...
research
08/29/2023

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Biological research has revealed that the verbal semantic information in...
research
03/14/2019

MirrorGAN: Learning Text-to-image Generation by Redescription

Generating an image from a given text description has two goals: visual ...
research
04/14/2021

Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech

Semantic information of a sentence is crucial for improving the expressi...
research
05/23/2023

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable pr...
research
06/29/2023

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, whi...
research
02/17/2021

I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion

This paper addresses the problem of media retrieval using a multimodal q...

Please sign up or login with your details

Forgot password? Click here to reset