-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the pretrained model on LibriTTS #20
Comments
Thanks for nice works I'm looking for nice zero-shot TTS models. I also hope to use a StyleTTS 2 (LibriTTS Ver.) for a baseline model. I think your model is much better than YourTTS or any recent LLM-based models in zero-shot TTS so I hope to compare your model as a state-of-the-art model. It would be appreciate if you could share a plan for LibriTTS model 😃 |
@yl4579 hi, yl4579. Look forward to the pretrained model on LibriTTS. Be grateful to you! |
Thank you for your interest in this work. I’m currently attending a few workshops and I’ll be busy with midterm exams after that, so the model release will be delayed a little bit. Expect it to arrive some time in early November. |
Hey there @yl4579 , I'm hoping to test out the LibriTTS-trained StyleTTS 2 as well. Would it be possible to release the training config for the multi-speaker version so I can try and train it on my own machines before you release the pre-trained models? P.S. Thanks for the work so far; the LJ version sounds very good. |
@gigadunk Here's the configuration that I am currently using to train the LibriTTS model. The dataset is very big so the epochs need to be adjusted according to the quality of the model. log_dir: "Models/LibriTTS"
first_stage_path: "first_stage.pth"
save_freq: 1
log_interval: 10
device: "cuda"
epochs_1st: 50 # number of epochs for first stage training (pre-training)
epochs_2nd: 30 # number of peochs for second stage training (joint training)
batch_size: 16
max_len: 300 # maximum number of frames
pretrained_model: "Models/LibriTTS/epoch_2nd_00005.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'
data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: ""
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts
preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300
model_params:
multispeaker: true
dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80
n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size
dropout: 0.2
# config for decoder
decoder:
type: 'hifigan' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10,5,3,2]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20,10,6,4]
# speech language model config
slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head
# style diffusion model config
diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2
# diffusion distribution config
dist:
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
mean: -3.0
std: 1.0
loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss
lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 4 # TMA starting epoch (1st stage)
lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)
diff_epoch: 10 # style diffusion starting epoch (2nd stage)
joint_epoch: 15 # joint training starting epoch (2nd stage)
optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules
slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 20 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling
|
Unfortunately, somebody found a mistake in the training code and informed me via email. I checked the quality of the model, and it sounds worse than the demo because of the mistake (wrong reference audio). I have fixed the mistake but I have to retrain the model from scratch. Now expect the model to be released by mid-November. Sorry for the delay. I believe the current code should produce working models now. |
The current model quality is not bad though, so if you need the model now, you can download it here: https://drive.google.com/drive/folders/1ApqjyugCzr4EN2NFXa5Opfr3qcoapUPV?usp=sharing, but I can probably get a better model a couple of weeks later. You only need to change the following code to run the inference: def compute_style(path):
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(device)
with torch.no_grad():
ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))
return torch.cat([ref_s, ref_p], dim=1) reference = "Demo/1221-135767-0014.wav"
ref_s = compute_style(reference)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
text_mask = length_to_mask(input_lengths).to(device)
t_en = model.text_encoder(tokens, input_lengths, text_mask)
bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
embedding=bert_dur,
embedding_scale=1,
features=ref_s, # reference from the same speaker as the embedding
num_steps=10).squeeze(1)
s = s_pred[:, 128:]
ref = s_pred[:, :128]
alpha = 0.3 # how much you want to mix the sampled style with the original style (acoustic part)
beta = 0.7 # how much you want to mix the sampled style with the original style (prosodic part)
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
d = model.predictor.text_encoder(d_en,
s, input_lengths, text_mask)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
duration = torch.sigmoid(duration).sum(axis=-1)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
# encode prosody
en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(en)
asr_new[:, :, 0] = en[:, :, 0]
asr_new[:, :, 1:] = en[:, :, 0:-1]
en = asr_new
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan": # fix weird misalignment for hifigan decoder
asr_new = torch.zeros_like(asr)
asr_new[:, :, 0] = asr[:, :, 0]
asr_new[:, :, 1:] = asr[:, :, 0:-1]
asr = asr_new
out = model.decoder(asr,
F0_pred, N_pred, ref.squeeze().unsqueeze(0)) Formal inference demo including reproducing audio on the demo page will come later once the better model is done. |
I tested it on the colab and it works, so if you want to try it now you can use this link: https://colab.research.google.com/drive/1VENAg_TeKj5a1NYMJTSrbNLDlcIT30Sh |
@yl4579 |
@GUUser91 Since StyleTTS2 is already an end-to-end model, meaning it generates waveforms directly from the text, I don’t see any use of this codec anywhere unless we don’t do end-to-end training, which may degrade the quality (though it could be faster in training). |
@yl4579 Thanks for sharing the checkpoint. Now, I'm synthesizing the speech with your model! 😀 However, I have some problems when I feed a very short reference audio to the style encoder because the fixed filter size of your style encoder. I have a simple trick to infer with short reference audio by just replicating audio before fed to style encoder. This may resolve this issue. Could you recommend a proper value of alpha and beta for LibriTTS samples?
In addition, I entirely agree that audio codec is not required for your model. The audio quality of StyleTTS 2 is already better than recent proposed 2-stage models such as Vall-E or NaturalSpeech 2 in terms of naturalness. Using audio codec will decrease the audio quality. and I found a typo in your demo page. https://styletts2.github.io/#libri I could not find these sample from LibriTTS. It seems that these samples are from LibriSpeech, not LibriTTS.😉 Thanks again |
@sh-lee-prml Thanks for your appreciation of this work! As for your problem of inference using very short clips (less than one second), you probably have to repeat the reference until it reaches the minimum length, and it could lead to potential problems as there is no such data during training (clips shore than 1 second were excluded during training). If you do need to do inference with very short references, you may have to retrain or fine-tune the model with shorter clips, possibly with repeating to accommodate the receptive field of the style encoder. The alpha and beta are just factors that control diversity and similarity. The higher the alpha and beta, the closer it is to the sampled style (and thus less similar to the actual reference style), and vice versa. It depends on the use case, i.e., do you want more diverse samples with the same text, or do you want more similar samples to the reference? Values ranging from The demo page indeed shows samples from LibriSpeech, because these were reference samples taken from the Vall-E and NaturalSpeech 2 demo pages. LibriTTS here refers to the model (i.e., model trained on LibriTTS), not the testing dataset. I have marked this difference in the paper. The Table 1 shows that the testing set for zero-shot experiments was LibriSpeech instead of LibriTTS. |
Since you are retraining... Would you be open to sharing the model weights in checkpoints instead of waiting for it to be fully trained? @yl4579 |
@pawngrubber There are multiple stages, and it is quite inconvenient to upload the checkpoints as each one of them is around 2G big. |
Hi, just try your new colab, it works great. But I got a problem, when I tried to change the text to Chinese I did some google search and try to change to pinyin, well it can read, but not good, I can't understand it without looking at the text. Can you give me some tips to get a better result, thank you |
@yl4579
|
As I know, this pretrain model does not support chinese or pinyin, only support English phone. Wish this can help you. |
I have started to train by load the pretrained model of stage2. i find your mistake, this pretrained model is stage2 ,but your command is stage1. Wish this can help you. |
@WendongGan |
I'm using the finetuned model from yl4579's StyleTTS2_libritts_debug.ipynb file. I get this error message after setting it to use the finetune model
I set the finetuned model settings to this:
Edit: Nevermind again. I fixed the problem by reinstalling StyleTTS2. I finetuned the model again. The only thing I edit in yl4579's StyleTTS2_libritts_debug.ipynb file was the location of the .pth file. Then I no longer got the error message.
|
@GUUser91 im having same issue trying to fine-tune |
i was just looking at that- it appears to demonstrate inference using reference audio, not fine tuning of the model |
@eschmidbauer
|
thanks @GUUser91 that is what i was looking for!! appreciate the help |
I think now I have got a better model and I will upload it to the repo. The qualify is very close to the demo now. There's still some small weird issues at the end of the model for some samples (not sure what causes these), and I'm trying to investigate the issue and maybe I can have a batter model without these problems later on. |
Heya @yl4579. I'm a little confused, I thought you trained the model used for the StyleTTS2 Demo. Why are you retraining it if you already have the model used for the demo? I'm probably missing context or something. Thanks :) |
@gigadunk The reason is I want to test if the code in the repo is working. I want to reproduce the models I used for the paper with the cleaned code, as it can be a little different from the one I had for the experiments (with Jupyter notebooks). See #1 for more context. The quality is very similar now, only there's a weird pulse (only for some reference) at the end of the speech, which can be easily fixed with [:-50] (removing the last 50 samples). I believe this is a minor issue and may be caused by some preprocessing in meldataset.py that might be a little different from the one I used for LibriTTS dataset for the paper. |
@yl4579 Thanks for the clarification :) I'm hyped to play around with the new model, when will it be on the repo? |
@gigadunk I'm making the demo now. It should be up today. |
I have pushed the demo notebook and uploaded the model. This issue should now be complete. If you find other problems of the model, please open new issues. |
could you share the checkpoint for fine-tuning? |
@eschmidbauer It’s in the README now. |
ok - i tried finetuning with the libritts model and i get a state missing error. Perhaps it's the config im using no longer works with that pretrained model |
feat: 增加自动格式化代码的 action
Hello, thank you for sharing such interesting work!
May I know what your plans are for sharing the pretrained model trained on LibriTTS?
Thanks! :)
The text was updated successfully, but these errors were encountered: