Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add YourTTS VCTK recipe #2198

Merged
merged 5 commits into from
Dec 12, 2022
Merged

Add YourTTS VCTK recipe #2198

merged 5 commits into from
Dec 12, 2022

Conversation

Edresson
Copy link
Contributor

@Edresson Edresson commented Dec 8, 2022

No description provided.

Copy link
Contributor

@WeberJulian WeberJulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, I think adding a script that downloads and prep VCTK, and also compute the d_vectors could help make this more useful to the users.

@erogol
Copy link
Member

erogol commented Dec 8, 2022

+1 for @WeberJulian's comment. You can use the downloader from https://github.com/coqui-ai/TTS/blob/dev/TTS/utils/downloaders.py

@Edresson
Copy link
Contributor Author

Edresson commented Dec 9, 2022

Thanks for doing this, I think adding a script that downloads and prep VCTK, and also compute the d_vectors could help make this more useful to the users.

+1 for @WeberJulian's comment. You can use the downloader from https://github.com/coqui-ai/TTS/blob/dev/TTS/utils/downloaders.py

Done I automatically resampled the audio and computed the speaker embeddings on the recipe :). In addition, I added all the useful parameters to enable multilingual training and Speaker Consistency Loss (SCL) like the paper. I guess after this recipe we will not have too many open issues about YourTTS anymore :).

@Edresson
Copy link
Contributor Author

Edresson commented Dec 9, 2022

@erogol Do you have any idea why the text unit test is broken? I did change nothing that affect this part of the code.

Copy link
Contributor

@WeberJulian WeberJulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, now reproducing YourTTS is only a single line away :)

@Edresson
Copy link
Contributor Author

Edresson commented Dec 12, 2022

@erogol Do you have any idea why the text unit test is broken? I did change nothing that affect this part of the code.

I rebased it and everything works fine :). @erogol I think we can merge it

@erogol erogol merged commit 3b1a28f into dev Dec 12, 2022
@erogol erogol deleted the dev-yourtts-rec branch December 12, 2022 15:14
@iamkhalidbashir
Copy link
Contributor

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False):
speaker_manager = SpeakerManager()
if get_from_config_or_model_args_with_default(config, "speakers_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None)
)
if get_from_config_or_model_args_with_default(config, "d_vector_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None)
)
return speaker_manager

Where it should be speakers_file instead of speaker_file?

Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding
I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson
Am I doing something wrong or should I create an issue?

@iamkhalidbashir
Copy link
Contributor

Btw here is my speakers file, it seems that it has d vectors of all speakers in it
speakers.zip

@iamkhalidbashir
Copy link
Contributor

iamkhalidbashir commented Dec 22, 2022

Also when restoring from /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth for 22kHz sample files
I do get Model restored from step 1000000 but rest of the interference errors are the same

@iamkhalidbashir
Copy link
Contributor

2 hacks I made in codebase to made it work:
First, I fixed the type and changed speaker_file to speakers_file on line 114:

if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False):
speaker_manager = SpeakerManager()
if get_from_config_or_model_args_with_default(config, "speakers_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None)
)
if get_from_config_or_model_args_with_default(config, "d_vector_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None)
)
return speaker_manager

Second, I changed save_ids_to_file method on line 419 to save_embeddings_to_file

def on_init_start(self, trainer):
"""Save the speaker.pth and language_ids.json at the beginning of the training. Also update both paths."""
if self.speaker_manager is not None:
output_path = os.path.join(trainer.output_path, "speakers.pth")
self.speaker_manager.save_ids_to_file(output_path)
trainer.config.speakers_file = output_path
# some models don't have `model_args` set
if hasattr(trainer.config, "model_args"):
trainer.config.model_args.speakers_file = output_path
trainer.config.save_json(os.path.join(trainer.output_path, "config.json"))
print(f" > `speakers.pth` is saved to {output_path}.")
print(" > `speakers_file` is updated in the config.json.")

But there could be a better way for the second hack?

@iamkhalidbashir
Copy link
Contributor

iamkhalidbashir commented Dec 22, 2022

Created PR for the fix: #2234
It does not solve the bug "Model restored from step 0" for multilingual vits vctk as a restore path

@Edresson
Copy link
Contributor Author

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False):
speaker_manager = SpeakerManager()
if get_from_config_or_model_args_with_default(config, "speakers_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None)
)
if get_from_config_or_model_args_with_default(config, "d_vector_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None)
)
return speaker_manager

Where it should be speakers_file instead of speaker_file?
Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:-

if get_from_config_or_model_args_with_default(config, "use_d_vector_file", False):
speaker_manager = SpeakerManager()
if get_from_config_or_model_args_with_default(config, "speakers_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "speaker_file", None)
)
if get_from_config_or_model_args_with_default(config, "d_vector_file", None):
speaker_manager = SpeakerManager(
d_vectors_file_path=get_from_config_or_model_args_with_default(config, "d_vector_file", None)
)
return speaker_manager

Where it should be speakers_file instead of speaker_file?
Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

Hi @iamkhalidbashir,
Could you please create an issue with the information that you have added here? Here is not the proper place and it will be better because even after we fix it users can have this issue if they are using an old TTS version.

@iamkhalidbashir
Copy link
Contributor

Sure I'll create a new issue

shivammehta25 pushed a commit to shivammehta25/TTS that referenced this pull request Jan 5, 2023
* Add YourTTS VCTK recipe

* Fix lint

* Add compute_embeddings and resample_files functions to be able to reuse it

* Add automatic download and speaker embedding computation for YourTTS VCTK recipe

* Add parameter for eval metadata file on compute embeddings function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants