PyTorch Implementation of AlignSTS (ACL 2023): a speech-to-singing (STS) model based on modality disentanglement and cross-modal alignment.
We provide our implementation and pretrained models in this repository.
Visit our demo page for audio samples.
- May, 2023: AlignSTS released at Github.
- May, 2023: AlignSTS accepted at ACL 2023 Findings.
We provide an example of how you can generate high-quality samples using AlignSTS.
You can use pretrained models we provide in here. Details of each folder are as in follows:
Model | Discription |
---|---|
AlignSTS | Acousitic model (config) |
HIFI-GAN | Neural Vocoder |
A suitable conda environment named alignsts
can be created and activated with:
conda create -n alignsts python=3.8
conda install --yes --file requirements.txt
conda activate alignsts
We provide a mini-set of test samples to demonstrate AlignSTS in here. Specifically, we provide samples of WAV format combining the corresponding statistical files which is for faster IO. Please download the statistical files at data/binary/speech2singing-testdata/
, while the WAV files are for listening.
FYI, the naming rule of the WAV files is [spk]#[song name]#[speech/sing identifier]#[sentence index].wav
. For example, a sample named 男3号#all we know#sing#14.wav
means a singing sample of song "all we know" from the 14th sentence, sung by the speaker "男3号".
Here we provide a speech-to-singing conversion pipeline using AlignSTS.
- Prepare AlignSTS (acoustic model): Download and put checkpoint at
checkpoints/alignsts
- Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at
checkpoints/hifigan
- Prepare dataset (test dataset): Download the statistical files of the test dataset at
data/binary/speech2singing-testdata
- Run
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --exp_name alignsts --infer --hparams "gen_dir_name=test" --config configs/singing/speech2singing/alignsts.yaml --reset
- You will find outputs in
checkpoints/alignsts/generated_200000_test
, where [G] indicates ground truth mel results and [P] indicates predicted results.
This implementation uses parts of the code from the following Github repos: NATSpeech, DiffSinger, ProDiff, SpeechSplit2 as described in our code.
If you find this code useful in your research, please cite our work:
@article{li2023alignsts,
title={AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment},
author={Li, Ruiqi and Huang, Rongjie and Zhanag, Lichao and Liu, Jinglin and Zhao, Zhou},
journal={Association for Computational Linguistics},
year={2023}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech/singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.