Recent advancements in vision-language pre-training (e.g. CLIP) have sho...
Large deep learning models have achieved remarkable success in many
scen...
Text-video retrieval is an important multi-modal learning task, where th...
Visual grounding, i.e., localizing objects in images according to natura...
Spatial redundancy widely exists in visual recognition tasks, i.e.,
disc...
Recent works have shown that the computational efficiency of video
recog...
In this paper, we explore the spatial redundancy in video recognition wi...
Reusing features in deep networks through dense connectivity is an effec...