This report introduces our novel method named STHG for the Audio-Visual
...
The rapid advancement of generative models, facilitating the creation of...
Do video-text transformers learn to model temporal relationships across
...
The task of dynamic scene graph generation (SGG) from videos is complica...
This report describes our approach for the Audio-Visual Diarization (AVD...
Active speaker detection (ASD) in videos with multiple speakers is a
cha...
We address the problem of active speaker detection through a new framewo...
It is well known that human gaze carries significant information about v...
Temporally localizing activities within untrimmed videos has been extens...
TASED-Net is a 3D fully-convolutional network architecture for video sal...
Deep neural networks have achieved impressive success in large-scale vis...