FUN! Fast, Universal, Non-Semantic Speech Embeddings
Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance is a significant bottleneck. We propose a class of lightweight universal speech embedding models based on MobileNet that are designed to run efficiently on mobile devices. These embeddings, which encapsulate speech non-semantics and thus can be re-used for several tasks, are trained via knowledge distillation. We show that these embedding models are fast enough to run in real-time on a variety of mobile devices and exhibit negligible performance degradation on most tasks in a recently published benchmark of non-semantic speech tasks. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as mask detection during speech and non-speech human sounds detection.
READ FULL TEXT