Learning Music Sequence Representation from Text Supervision

by   Tianyu Chen, et al.

Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals. To excavate better MUsic SEquence Representation from labeled audio, we propose a novel text-supervision pre-training method, namely MUSER. MUSER adopts an audio-spectrum-text tri-modal contrastive learning framework, where the text input could be any form of meta-data with the help of text templates while the spectrum is derived from an audio sequence. Our experiments reveal that MUSER could be more flexibly adapted to downstream tasks compared with the current data-hungry pre-training method, and it only requires 0.056 data to achieve the state-of-the-art performance.


page 1

page 2

page 3

page 4


Learning Music Representations with wav2vec 2.0

Learning music representations that are general-purpose offers the flexi...

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

In this work, we provide a broad comparative analysis of strategies for ...

Learning music audio representations via weak language supervision

Audio representations for music information retrieval are typically lear...

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradig...

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundat...

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...

Please sign up or login with your details

Forgot password? Click here to reset