Audio-visual representation learning aims to develop systems with human-...
To realize human-robot collaboration, robots need to execute actions for...
In this paper, we show that representations capturing syllabic units eme...
We investigate the emergent abilities of the recently proposed web-scale...
For the majority of the machine learning community, the expensive nature...
In this paper, we propose a simple yet powerful improvement over the rec...
We present a method for visually-grounded spoken term discovery. After
t...
In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
We propose a new unsupervised model for mapping a variable-duration spee...