SLiMe: Segment Like Me

by   Aliasghar Khani, et al.

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.


page 1

page 2

page 3

page 6

page 13

page 14

page 15

page 16


Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Producing quality segmentation masks for images is a fundamental problem...

One-Shot Segmentation of Novel White Matter Tracts via Extensive Data Augmentation

Deep learning based methods have achieved state-of-the-art performance f...

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

Segment Anything (SAM), an advanced universal image segmentation model t...

Personalize Segment Anything Model with One Shot

Driven by large-data pre-training, Segment Anything Model (SAM) has been...

Segmenting Numerical Substitution Ciphers

Deciphering historical substitution ciphers is a challenging problem. Ex...

Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Current vision language pretraining models are dominated by methods usin...

Please sign up or login with your details

Forgot password? Click here to reset