Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification

04/27/2023
by   Xuehu Liu, et al.
0

Advanced deep Convolutional Neural Networks (CNNs) have shown great success in video-based person Re-Identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the inter-patch relations with global observations for performance improvements. In this work, we take both sides and propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Firstly, we couple CNNs and Transformers to extract two kinds of visual features and experimentally verify their complementarity. Further, in spatial, we propose a Complementary Content Attention (CCA) to take advantages of the coupled structure and guide independent features for spatial complementary learning. In temporal, a Hierarchical Temporal Aggregation (HTA) is proposed to progressively capture the inter-frame dependencies and encode temporal information. Besides, a gated attention is utilized to deliver aggregated temporal information into the CNN and Transformer branches for temporal complementary learning. Finally, we introduce a self-distillation training strategy to transfer the superior spatial-temporal knowledge to backbone networks for higher accuracy and more efficiency. In this way, two kinds of typical features from same videos are integrated mechanically for more informative representations. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework could attain better performances than most state-of-the-art methods.

READ FULL TEXT

page 1

page 3

page 4

page 9

research
04/05/2021

A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification

Video-based person re-identification (Re-ID) aims to retrieve video sequ...
research
07/13/2021

HAT: Hierarchical Aggregation Transformers for Person Re-identification

Recently, with the advance of deep Convolutional Neural Networks (CNNs),...
research
02/22/2018

Video Person Re-identification by Temporal Residual Learning

In this paper, we propose a novel feature learning framework for video p...
research
08/03/2017

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification

Person Re-Identification (person re-id) is a crucial task as its applica...
research
07/25/2021

Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Despite much recent progress in video-based person re-identification (re...
research
05/30/2020

Complex Sequential Understanding through the Awareness of Spatial and Temporal Concepts

Understanding sequential information is a fundamental task for artificia...
research
01/02/2023

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

In recent years, the Transformer architecture has shown its superiority ...

Please sign up or login with your details

Forgot password? Click here to reset