Semantic query-by-example speech search using visual grounding

04/15/2019
by   Herman Kamper, et al.
0

A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode retrieval. Here we consider how such models can be used for query-by-example (QbE) search, the task of retrieving utterances relevant to a given spoken query. We are particularly interested in semantic QbE, where the task is not only to retrieve utterances containing exact instances of the query, but also utterances whose meaning is relevant to the query. We follow a segmental QbE approach where variable-duration speech segments (queries, search utterances) are mapped to fixed-dimensional embedding vectors. We show that a QbE system using an embedding function trained on visually grounded speech data outperforms a purely acoustic QbE system in terms of both exact and semantic retrieval performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2021

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken cap...
research
10/05/2017

Semantic keyword spotting by learning from images and speech

We consider the problem of representing semantic concepts in speech by l...
research
06/13/2018

Visually grounded cross-lingual keyword spotting in speech

Recent work considered how images paired with speech can be used as supe...
research
06/12/2017

Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings

Query-by-example search often uses dynamic time warping (DTW) for compar...
research
07/04/2022

BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model

Several recent studies have tested the use of transformer language model...
research
11/24/2020

Acoustic span embeddings for multilingual query-by-example search

Query-by-example (QbE) speech search is the task of matching spoken quer...
research
10/23/2022

Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings

Inducing semantic representations directly from speech signals is a high...

Please sign up or login with your details

Forgot password? Click here to reset