Large vision-language models (VLMs), such as CLIP, learn rich joint
imag...
We wish to automatically predict the "speediness" of moving objects in
v...
Separating mixed distributions is a long standing challenge for machine
...
Many speech segments in movies are re-recorded in a studio during
postpr...
We present a joint audio-visual model for isolating a single speech sign...
Isolating the voice of a specific person while filtering out other voice...
Speechreading is the task of inferring phonetic information from visuall...
Speechreading is a notoriously difficult task for humans to perform. In ...
While egocentric video is becoming increasingly popular, browsing it is ...