Add YourTTS VCTK recipe #2198

Edresson · 2022-12-08T13:49:59Z

No description provided.

WeberJulian

Thanks for doing this, I think adding a script that downloads and prep VCTK, and also compute the d_vectors could help make this more useful to the users.

erogol · 2022-12-08T22:35:45Z

+1 for @WeberJulian's comment. You can use the downloader from https://github.com/coqui-ai/TTS/blob/dev/TTS/utils/downloaders.py

Edresson · 2022-12-09T21:23:47Z

Thanks for doing this, I think adding a script that downloads and prep VCTK, and also compute the d_vectors could help make this more useful to the users.

+1 for @WeberJulian's comment. You can use the downloader from https://github.com/coqui-ai/TTS/blob/dev/TTS/utils/downloaders.py

Done I automatically resampled the audio and computed the speaker embeddings on the recipe :). In addition, I added all the useful parameters to enable multilingual training and Speaker Consistency Loss (SCL) like the paper. I guess after this recipe we will not have too many open issues about YourTTS anymore :).

Edresson · 2022-12-09T21:25:40Z

@erogol Do you have any idea why the text unit test is broken? I did change nothing that affect this part of the code.

…se it

…VCTK recipe

WeberJulian

Great, now reproducing YourTTS is only a single line away :)

Edresson · 2022-12-12T13:39:04Z

@erogol Do you have any idea why the text unit test is broken? I did change nothing that affect this part of the code.

I rebased it and everything works fine :). @erogol I think we can merge it

iamkhalidbashir · 2022-12-22T06:39:09Z

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

TTS/TTS/tts/utils/speakers.py

Lines 110 to 120 in 9e5a469

    
           if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False): 
        
               speaker_manager = SpeakerManager() 
        
               if get_from_config_or_model_args_with_default(config, "speakers_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None) 
        
                   ) 
        
               if get_from_config_or_model_args_with_default(config, "d_vector_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None) 
        
                   ) 
        
           return speaker_manager

Where it should be speakers_file instead of speaker_file?

Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding
I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson
Am I doing something wrong or should I create an issue?

iamkhalidbashir · 2022-12-22T08:01:47Z

Btw here is my speakers file, it seems that it has d vectors of all speakers in it
speakers.zip

iamkhalidbashir · 2022-12-22T08:03:14Z

Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files
I do get Model restored from step 1000000 but rest of the interference errors are the same

iamkhalidbashir · 2022-12-22T09:12:14Z

2 hacks I made in codebase to made it work:
First, I fixed the type and changed speaker_file to speakers_file on line 114:

TTS/TTS/tts/utils/speakers.py

Lines 110 to 120 in 9e5a469

    
           if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False): 
        
               speaker_manager = SpeakerManager() 
        
               if get_from_config_or_model_args_with_default(config, "speakers_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None) 
        
                   ) 
        
               if get_from_config_or_model_args_with_default(config, "d_vector_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None) 
        
                   ) 
        
           return speaker_manager

Second, I changed save_ids_to_file method on line 419 to save_embeddings_to_file

TTS/TTS/tts/models/base_tts.py

Lines 415 to 426 in a9167cf

    
           def on_init_start(self, trainer): 
        
               """Save the speaker.pth and language_ids.json at the beginning of the training. Also update both paths.""" 
        
               if self.speaker_manager is not None: 
        
                   output_path = os.path.join(trainer.output_path, "speakers.pth") 
        
                   self.speaker_manager.save_ids_to_file(output_path) 
        
                   trainer.config.speakers_file = output_path 
        
                   # some models don't have `model_args` set 
        
                   if hasattr(trainer.config, "model_args"): 
        
                       trainer.config.model_args.speakers_file = output_path 
        
                   trainer.config.save_json(os.path.join(trainer.output_path, "config.json")) 
        
                   print(f" > `speakers.pth` is saved to {output_path}.") 
        
                   print(" > `speakers_file` is updated in the config.json.")

But there could be a better way for the second hack?

iamkhalidbashir · 2022-12-22T10:20:28Z

Created PR for the fix: #2234
It does not solve the bug "Model restored from step 0" for multilingual vits vctk as a restore path

Edresson · 2022-12-22T11:01:40Z

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

TTS/TTS/tts/utils/speakers.py

Lines 110 to 120 in 9e5a469

    
           if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False): 
        
               speaker_manager = SpeakerManager() 
        
               if get_from_config_or_model_args_with_default(config, "speakers_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None) 
        
                   ) 
        
               if get_from_config_or_model_args_with_default(config, "d_vector_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None) 
        
                   ) 
        
           return speaker_manager

Where it should be speakers_file instead of speaker_file?
Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

TTS/TTS/tts/utils/speakers.py

Lines 110 to 120 in 9e5a469

    
           if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False): 
        
               speaker_manager = SpeakerManager() 
        
               if get_from_config_or_model_args_with_default(config, "speakers_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None) 
        
                   ) 
        
               if get_from_config_or_model_args_with_default(config, "d_vector_file", None): 
        
                   speaker_manager = SpeakerManager( 
        
                       d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None) 
        
                   ) 
        
           return speaker_manager

Where it should be speakers_file instead of speaker_file?
Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

Hi @iamkhalidbashir,
Could you please create an issue with the information that you have added here? Here is not the proper place and it will be better because even after we fix it users can have this issue if they are using an old TTS version.

iamkhalidbashir · 2022-12-22T11:42:00Z

Sure I'll create a new issue

* Add YourTTS VCTK recipe * Fix lint * Add compute_embeddings and resample_files functions to be able to reuse it * Add automatic download and speaker embedding computation for YourTTS VCTK recipe * Add parameter for eval metadata file on compute embeddings function

Edresson force-pushed the dev-yourtts-rec branch from 978fe08 to c63c073 Compare December 8, 2022 13:52

Edresson requested review from erogol and WeberJulian December 8, 2022 13:53

WeberJulian approved these changes Dec 8, 2022

View reviewed changes

Edresson force-pushed the dev-yourtts-rec branch from 93bf6cf to f771cd1 Compare December 9, 2022 21:19

Edresson requested a review from WeberJulian December 9, 2022 21:20

Edresson force-pushed the dev-yourtts-rec branch from 4d4c96d to 8ca3e78 Compare December 10, 2022 20:05

Edresson added 5 commits December 12, 2022 09:23

Add YourTTS VCTK recipe

e87bbde

Fix lint

5d925ea

Add compute_embeddings and resample_files functions to be able to reu…

0bf2746

…se it

Add automatic download and speaker embedding computation for YourTTS …

d7c2a8e

…VCTK recipe

Add parameter for eval metadata file on compute embeddings function

a066e14

Edresson force-pushed the dev-yourtts-rec branch from 8ca3e78 to a066e14 Compare December 12, 2022 12:23

WeberJulian approved these changes Dec 12, 2022

View reviewed changes

erogol merged commit 3b1a28f into dev Dec 12, 2022

erogol deleted the dev-yourtts-rec branch December 12, 2022 15:14

iamkhalidbashir mentioned this pull request Dec 22, 2022

Fixed bug related to yourtts speaker embeddings issue #2234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add YourTTS VCTK recipe #2198

Add YourTTS VCTK recipe #2198

Edresson commented Dec 8, 2022

WeberJulian left a comment

erogol commented Dec 8, 2022

Edresson commented Dec 9, 2022 •

edited

Loading

Edresson commented Dec 9, 2022

WeberJulian left a comment

Edresson commented Dec 12, 2022 •

edited

Loading

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022 •

edited

Loading

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022 •

edited

Loading

Edresson commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

Add YourTTS VCTK recipe #2198

Add YourTTS VCTK recipe #2198

Conversation

Edresson commented Dec 8, 2022

WeberJulian left a comment

Choose a reason for hiding this comment

erogol commented Dec 8, 2022

Edresson commented Dec 9, 2022 • edited Loading

Edresson commented Dec 9, 2022

WeberJulian left a comment

Choose a reason for hiding this comment

Edresson commented Dec 12, 2022 • edited Loading

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022 • edited Loading

iamkhalidbashir commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022 • edited Loading

Edresson commented Dec 22, 2022

iamkhalidbashir commented Dec 22, 2022

Edresson commented Dec 9, 2022 •

edited

Loading

Edresson commented Dec 12, 2022 •

edited

Loading

iamkhalidbashir commented Dec 22, 2022 •

edited

Loading

iamkhalidbashir commented Dec 22, 2022 •

edited

Loading