Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization
Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. To address the weakly labelled SED problem, we investigate segment-wise training and clip-wise training methods. The proposed systems are based on the variants of convolutional neural networks (CNNs) including convolutional recurrent neural networks and our proposed CNN-transformers for audio tagging and sound event detection. Another challenge of SED is that only the presence probabilities of sound events are predicted and thresholds are required to predict the presence or absence of sound events. Previous work set this threshold empirically which is not an optimised solution. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on the thresholds such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to the metric that depends on those thresholds. This proposed automatic threshold optimization system achieved state-of-the-art audio tagging and SED F1 score of 0.646, 0.584, outperforming the performance with best manually selected thresholds of 0.629 and 0.564, respectively.
READ FULL TEXT