Multi-Modal Prototypes for Open-Set Semantic Segmentation

07/05/2023
by   Yuhuan Yang, et al.
0

In semantic segmentation, adapting a visual system to novel object categories at inference time has always been both valuable and challenging. To enable such generalization, existing methods rely on either providing several support examples as visual cues or class names as textual cues. Through the development is relatively optimistic, these two lines have been studied in isolation, neglecting the complementary intrinsic of low-level visual and high-level language information. In this paper, we define a unified setting termed as open-set semantic segmentation (O3S), which aims to learn seen and unseen semantics from both visual examples and textual names. Our pipeline extracts multi-modal prototypes for segmentation task, by first single modal self-enhancement and aggregation, then multi-modal complementary fusion. To be specific, we aggregate visual features into several tokens as visual prototypes, and enhance the class name with detailed descriptions for textual prototype generation. The two modalities are then fused to generate multi-modal prototypes for final segmentation. On both and datasets, we conduct extensive experiments to evaluate the framework effectiveness. State-of-the-art results are achieved even on more detailed part-segmentation, Pascal-Animals, by only training on coarse-grained datasets. Thorough ablation studies are performed to dissect each component, both quantitatively and qualitatively.

READ FULL TEXT

page 9

page 11

page 13

research
05/16/2023

Multi-modal Visual Understanding with Prompts for Semantic Information Disentanglement of Image

Multi-modal visual understanding of images with prompts involves using v...
research
05/19/2022

Support-set based Multi-modal Representation Enhancement for Video Captioning

Video captioning is a challenging task that necessitates a thorough comp...
research
05/17/2021

Multi-modal Visual Place Recognition in Dynamics-Invariant Perception Space

Visual place recognition is one of the essential and challenging problem...
research
08/31/2023

Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Open-vocabulary semantic segmentation is a challenging task that require...
research
09/03/2020

TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback

The ability to efficiently search for images over an indexed database is...
research
04/26/2021

Rich Semantics Improve Few-shot Learning

Human learning benefits from multi-modal inputs that often appear as ric...
research
07/09/2021

Multi-Modal Association based Grouping for Form Structure Extraction

Document structure extraction has been a widely researched area for deca...

Please sign up or login with your details

Forgot password? Click here to reset