Towards visually prompted keyword localisation for zero-resource spoken languages

10/12/2022
by   Leanne Nortje, et al.
0

Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising attention mechanism which we train with a new keyword sampling scheme. We show that these innovations give improvements in VPKL over an existing speech-vision model. We also compare to a visual bag-of-words (BoW) model where images are automatically tagged with visual labels and paired with unlabelled speech. Although this visual BoW can be queried directly with a written keyword (while our's takes image queries), our new model still outperforms the visual BoW in both detection and localisation, giving a 16 relative improvement in localisation F1.

READ FULL TEXT

page 3

page 5

research
06/16/2021

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...
research
10/29/2021

Visual Keyword Spotting with Attention

In this paper, we consider the task of spotting spoken keywords in silen...
research
12/31/2020

EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting

Keyword spotting is a process of finding some specific words or phrases ...
research
02/11/2020

Phoneme Boundary Detection using Learnable Segmental Features

Phoneme boundary detection plays an essential first step for a variety o...
research
08/31/2023

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

This study presents a novel zero-shot user-defined keyword spotting mode...
research
12/14/2020

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enabl...
research
08/23/2021

End-to-End Open Vocabulary Keyword Search

Recently, neural approaches to spoken content retrieval have become popu...

Please sign up or login with your details

Forgot password? Click here to reset