TVPR: Text-to-Video Person Retrieval and a New Benchmark

by   Fan Ni, et al.
NetEase, Inc

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.


page 1

page 7

page 9


DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Many previous methods on text-based person retrieval tasks are devoted t...

Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

In this paper we introduce a new natural language processing dataset and...

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

In text-video retrieval, the objective is to learn a cross-modal similar...

CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval

Given a natural language description, text-based person retrieval aims t...

Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection

Person search with natural language aims to retrieve the corresponding p...

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

In Actor and Observer we introduced a dataset linking the first and thir...

APES: Audiovisual Person Search in Untrimmed Video

Humans are arguably one of the most important subjects in video streams,...

Please sign up or login with your details

Forgot password? Click here to reset