This repository contains the code and models of the paper: Singer Identity Representation Learning using Self-Supervised Techniques.
You can find additional visualizations of the singer embeddings and supplementary material here.
Bernardo Torres, Stefan Lattner and Gaël Richard
You can download and load the pretrained models using the following command:
from singer_identity import load_model
model = load_model(model_name)
model.eval()
This will load the model using HuggingFace Hub.
You can also use load_model(model_name, torchscript=True)
to load a scripted version of the model.
If using a sample rate different than 44.1kHz, you can specify it using input_sr
, eg. load_model(model_path, input_sr=16000)
. This will upsample the audio to 44.1kHz before computing the embeddings. Please note that the model was trained on full band signals, so a difference in performance can be expected.
The pretrained models are available on HuggingFace Hub:
byol
: trained with BYOLcontrastive
: trained with the decoupled contrastive losscontrastive-vc
: trained with the decoupled contrastive loss + variance and covariance regularizationsuniformity
: trained with the uniformity lossvicreg
: trained with the vicreg loss
Example:
from singer_identity import load_model
model = load_model('byol')
model.eval()
audio_batch = ... # Get audio from somewhere (here in 44.1 kHz), shape: (batch_size, n_samples)
embeddings = model(audio_batch) # shape: (batch_size, 1000)
We provide the code to train a simple model on the following SSL tasks:
- Contrastive Learning (SimCLR, COLA) [1,2]
- Uniformity-Alignment [3]
- VICReg [4]
- BYOL [5]
The default backbone is the EfficientNet-B0 [6], with average pooling as temporal aggregation.
Our training script uses PyTorch Lightning and Lightning CLI. To train a model, use the train.py
script as follows:
python train.py --config common.yaml --config model_config.yaml
See the config folder for details on the configuration file for each SSL training.
To load a model from a local path (eg for testing trained/finetuned models), make sure to place the model file model.pt
in a folder model_folder
with the corresponding hyperparams.yaml
:
model = load_model(model_folder, source=/path/to/model/folder)
model.eval()
To convert from a Pytorch Lightning checkpoint to an Identity Encoder model.pt
, use the convert_checkpoint.py
script:
python convert_checkpoint.py --checkpoint /path/to/checkpoint.ckpt --config /path/to/config.yaml --output_dir /path/to/output_dir
The default dataloader expects the data to be in the following structure for training.
├── dataset1_name
│ ├── singer1 <- .wav files of group 1 should be placed here, up to 3 levels of subfolders are allowed
│ │ ├── file1.wav
│ │ ├── ..
│ ├── singer2
│ ├── singer3
│ └── ...
├── dataset2_name
│ ├── groupn..
The folder labels are not used during training, as we sample audio files independently with random chance. In the config file used to launch training, specify the datasets to use as follows:
data:
class_path: singer_identity.data.siamese_encoders.SiameseEncodersDataModule # default the dataloader class
init_args:
dataset_dirs:
- '/Path/to/dataset1/dataset1_name'
- '/Path/to/dataset2/dataset2_name'
Data Augmentations: Data augmentations are applied on the time domain on the fly. To set up augmentations used in the paper check the config folder.
Visualizing Training Logs: You can visualize the training logs using TensorBoard if you wish. Install TensorBoard and run the following command: tensorboard --logdir ./logs
.
Replace class_path
field in the config file to use different a different logger.
- PyTorch Lightning for training
- Lightning CLI for launching training using yaml config files
- nnAudio for computing mel-spectrograms on the fly
- Audiomentations for data augmentation
- Soundfile for audio loading
- Parselmouth for pitch shifting
You can use the provided environment.yml
file to create a conda environment with the required dependencies.
The following steps prepare the data for evaluation as it was used in the paper. It crops the audio files in non-overlaping segments n seconds and copies them to a flattened structure.
- Make sure the dataset is in the following structure:
├── dataset_root_folder
├── dataset_name
│ ├── singer1 <- .wav files of singer 1 should be placed here, up to 3 levels of subfolders are allowed
│ │ ├── file1.wav
│ │ ├── subsubdir
│ │ │ ├── file2.wav
│ ├── singer2
│ ├── singer3
│ └── ...
-
Run the preprocessing script to flatten the wav files under singer subdirectories and crop them in segments of
n_seconds
seconds:python preprocess_dataset.py --dataset_root_dir root_folder --dataset_name dataset_folder --segment_length n_seconds --sample_rate sample_rate
This script will extract the wav files from the nested structure and place them in one level per singer. It duplicates the files and crops them in
n_seconds
segments. It will also rename them to the following format:{subdir}_{subsubdir}_{filename}_0_4_{n_seconds}s.wav
, where 0 and 4 are the start and end of the segment in seconds,subdir
is the first folder (usually the singer name) andsubsubdir
the second level.The preprocessing pipeline will create the following structure:
├── dataset_name │ ├── singer1 │ │ ├── singer1_file1_0_4_4s.wav │ │ ├── singer1_file1_1_4_8_4s.wav -- if the file is longer than 4 seconds, it will be split in 4s segments │ │ ├── singer1_subsubdir_file2_2_0_4_4s.wav │ ├── singer2 │ │ ├── singer2_file1_0_4_4s.wav │ │ ├── ... │ ├── singer3 │ └── ...
-
You can compute speaker pairs for EER using the
preprocess/compute_speaker_pairs.py
script (or use the ones provided in the metadata folder and here (VocalSet) here (M4singer) )
Example:
python create_speaker_pairs.py -r /path/to/dataset -o /where/sample_pairs/will/be/saved -n n_singers -p n_draws
First, computing speaker trial pairs is needed (see above). They are stored in a metadata folder, (eg. metadata/vocalset/speaker_pairs.txt, metadata/vctk/speaker_pairs.txt). The EER computation follows the one available on SUPERB.
python eval.py -s seed -r root -d data -m model -meta metadata -f -cr -ce -bs batch_size
The arguments are:
-s
: random seed for reproducibility-r
: path to the dataset root folder-d
: list of dataset folders to test on-meta
: path to the metadata folder-m
: path to the model file or huggingface model name-f
: whether to compute scores using the encoder feature embeddings-cr
: whether to compute Mean Normalized Rank (MNR)-ce
: whether to compute EER-bs
: batch size for evaluation
Also available:
-du
: whether to use downsample the signals to 16kHz and upsample them back to 44.1kHz before computing the embeddings-p
: whether to compute scores using the projection layer
Example:
python eval.py -s 123 -r /data/datasets -d vocalset -m byol_model -meta test_scores/metadata -f True -cr True -ce True -bs 128
If you want to evaluate your own models simply override the load_id_extractor(model_file, source)
method eval.py
.
- To train singer identification linear classifiers:
Coming soon
The model was tested with the following out of domain datasets:
If you find this work useful for your research, please cite the paper:
@inproceedings{torres2023singer,
title={Singer Identity Representation Learning using Self-Supervised Techniques},
author={Torres, Bernardo and Lattner, Stefan and Richard, Gael},
booktitle={International Society for Music Information Retrieval Conference (ISMIR 2023)},
year={2023}
}
[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
[2] A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in ICASSP, 2021.
[3] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in ICML, 2020.
[4] A. Bardes, J. Ponce, and Y. LeCun, “VICReg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
[5] J.-B. Grill et al., “Bootstrap your own latent - A new approach to self-supervised learning,” in NeurIPS, 2020.
[6] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019.