Text-Visual Prompting for Efficient 2D Temporal Video Grounding

by   Yimeng Zhang, et al.

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. The proposed prompts also compensate for the lack of spatiotemporal information in 2D CNNs for visual feature extraction. Further, we propose a TemporalDistance IoU (TDIoU) loss for efficient learning of TVG. Last but not least, extensive experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79 Captions) and achieves 5x inference acceleration over TVG of using 3D visual features. Code and model will be released.


Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal ground...

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Automatic generation of video captions is a fundamental challenge in com...

What Is the Difference Between a Mountain and a Molehill? Quantifying Semantic Labeling of Visual Features in Line Charts

Relevant language describing visual features in charts can be useful for...

Visual Encoding and Debiasing for CTR Prediction

Extracting expressive visual features is crucial for accurate Click-Thro...

Towards Visual Feature Translation

Most existing visual search systems are deployed based upon fixed kinds ...

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Given an untrimmed video, temporal sentence grounding (TSG) aims to loca...

Direct Visual Servoing Based on Discrete Orthogonal Moments

This paper proposes a new approach to achieve direct visual servoing (DV...

Please sign up or login with your details

Forgot password? Click here to reset