Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

03/24/2022
by   Omkar Thawakar, et al.
9

State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multi-scale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multi-scale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1 reported results in literature by 2.7 threshold of AP_75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0 are available at https://github.com/OmkarThawakar/MSSTS-VIS.

READ FULL TEXT

page 2

page 7

page 13

page 14

research
03/12/2022

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

Video instance segmentation (VIS) task requires classifying, segmenting,...
research
01/20/2023

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

Most existing transformer based video instance segmentation methods extr...
research
10/07/2022

Time-Space Transformers for Video Panoptic Segmentation

We propose a novel solution for the task of video panoptic segmentation,...
research
03/21/2023

3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers

Accurate 3D mitochondria instance segmentation in electron microscopy (E...
research
07/22/2022

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

Video Instance Segmentation (VIS) jointly tackles multi-object detection...
research
02/22/2023

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

This paper presents a deep learning framework for medical video segmenta...
research
04/03/2023

Video Instance Segmentation in an Open-World

Existing video instance segmentation (VIS) approaches generally follow a...

Please sign up or login with your details

Forgot password? Click here to reset