Deep learning has made significant strides in video understanding tasks,...
We introduce Sketch-based Video Object Localization (SVOL), a new task a...
Standard multi-modal models assume the use of the same modalities in tra...
Camera traps, unmanned observation devices, and deep learning-based imag...
In multi-modal action recognition, it is important to consider not only ...
We present a new paradigm named explore-and-match for video grounding, w...
Identifying relations between objects is central to understanding the sc...
In this work, we seek new insights into the underlying challenges of the...