Polyphonic Pitch Tracking with Deep Layered Learning
This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses six artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used for weight-sharing throughout the system. The f0-activations are connected across time to extract pitch ridges. These ridges define a framework, within which subsequent networks perform tone-shift-invariant onset and offset detection. The networks convolve the pitch ridges across time, using as input, e.g., variations of latent representations from the f0 estimation networks, defined as the "neural flux." Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public datasets (MAPS, Bach10, Trios, and the MIREX Woodwind quintet), and performed state-of-the-art results for all four datasets. The system performs well across all subtasks: f0, pitched onset, and pitched offset tracking.
READ FULL TEXT