Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach

by   Dawei Liang, et al.

Acoustic sensing has proved effective as a foundation for numerous applications in health and human behavior analysis. In this work, we focus on the problem of detecting in-person social interactions in naturalistic settings from audio captured by a smartwatch. As a first step towards detecting social interactions, it is critical to distinguish the speech of the individual wearing the watch from all other sounds nearby, such as speech from other individuals and ambient sounds. This is very challenging in realistic settings, where interactions take place spontaneously and supervised models cannot be trained apriori to recognize the full complexity of dynamic social environments. In this paper, we introduce a transfer learning-based approach to detect foreground speech of users wearing a smartwatch. A highlight of the method is that it does not depend on the collection of voice samples to build user-specific models. Instead, the approach is based on knowledge transfer from general-purpose speaker representations derived from public datasets. Our experiments demonstrate that our approach performs comparably to a fully supervised model, with 80 dataset of 31 hours of smartwatch-recorded audio in 18 homes with a total of 39 participants performing various semi-controlled tasks.


Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Deep learning models are becoming predominant in many fields of machine ...

Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts

Adapting one's voice to different ambient environments and social intera...

Unsupervised Audio-Visual Subspace Alignment for High-Stakes Deception Detection

Automated systems that detect deception in high-stakes situations can en...

Speech Tasks Relevant to Sleepiness Determined with Deep Transfer Learning

Excessive sleepiness in attention-critical contexts can lead to adverse ...

Discriminate natural versus loudspeaker emitted speech

In this work, we address a novel, but potentially emerging, problem of d...

Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion

Textual escalation detection has been widely applied to e-commerce compa...

Finding Dory in the Crowd: Detecting Social Interactions using Multi-Modal Mobile Sensing

Remembering our day-to-day social interactions is challenging even if yo...

Please sign up or login with your details

Forgot password? Click here to reset