KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

03/01/2019
by   Egor Lakomkin, et al.
0

In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5 samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set. The demo (http://emnlp-demo.lakomkin.me/) and the crawler code (https://github.com/EgorLakomkin/KTSpeechCrawler) are publicly available.

READ FULL TEXT
research
12/17/2021

JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

In this paper, we construct a new Japanese speech corpus called "JTubeSp...
research
11/02/2018

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Building an accurate automatic speech recognition (ASR) system requires ...
research
04/07/2021

Pushing the Limits of Non-Autoregressive Speech Recognition

We combine recent advancements in end-to-end speech recognition to non-a...
research
11/08/2019

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

This work presents a large-scale audio-visual speech recognition system ...
research
07/13/2018

Large-Scale Visual Speech Recognition

This work presents a scalable solution to open-vocabulary visual speech ...
research
12/01/2017

Visual Features for Context-Aware Speech Recognition

Automatic transcriptions of consumer-generated multi-media content such ...
research
08/02/2018

AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies

Speech activity detection (or endpointing) is an important processing st...

Please sign up or login with your details

Forgot password? Click here to reset