Audio-visual representation learning aims to develop systems with human-...
This work investigates the use of large-scale, pre-trained models (CLIP ...
Recent visuolinguistic pre-trained models show promising progress on var...
Data-driven speech processing models usually perform well with a large a...