SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

by   Jiale Cao, et al.

Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0 speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0 comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at


page 2

page 5

page 8

page 12

page 13

page 14


QueryInst: Parallelly Supervised Mask Query for Instance Segmentation

Recently, query based object detection frameworks achieve comparable per...

Real-time instance segmentation with polygons using an Intersection-over-Union loss

Predicting a binary mask for an object is more accurate but also more co...

YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

We propose YolactEdge, the first competitive instance segmentation appro...

Real-time Human-Centric Segmentation for Complex Video Scenes

Most existing video tasks related to "human" focus on the segmentation o...

DeepGamble: Towards unlocking real-time player intelligence using multi-layer instance segmentation and attribute detection

Annually the gaming industry spends approximately 15 billion in marketin...

SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

Video instance segmentation (VIS) is a new and critical task in computer...

Unifying Visual Perception by Dispersible Points Learning

We present a conceptually simple, flexible, and universal visual percept...

Please sign up or login with your details

Forgot password? Click here to reset