Fine-Tuning with a small dataset #296

OscarVanL · 2020-10-11T16:00:07Z

Hello!

I'm trying to evaluate ways to achieve TTS for individuals that have lost their ability to speak, the idea is to allow them to regain speech via TTS but using the voice they had prior to losing their voice. This could happen from various causes such as cancer of the larynx, motor neurone disease, etc.

These patients have recorded voice banks, a small dataset of phrases recorded prior to losing their ability to speak.

Conceptually, I wanted to take a pre-trained model and fine-tune it with the individual's voice bank data.

I'd love some guidance.

There are a few constraints:

The patient-specific data bank is not a large dataset, it's approximately 100 recorded phrases.
Latency must be low, we hope for real-time TTS. Some approaches use a pre-trained model followed by vocoders, in our experience, this has been too slow, with latencies of about 5 seconds.
The trained model must work on an Android app (I see there is already an Android example, which has been helpful)

I'd love your guidance on the steps required to achieve this, and any recommendations on which choices would give good results...

Which model architectures will tolerate tuning with a small dataset?
The patients have British accents, whereas most pre-trained models have American accents. Will this be a problem?

Do you have any tutorials or examples that show how to achieve a customised voice via fine-tuning?

dathudeptrai · 2020-10-12T04:29:47Z

@OscarVanL Hi, great idea :D. I have some guidances for you to customized voice via fine-tuning bellow:

About latency, fastspeech2 + mb-melgan is enough for you in this case, it can run in real-time on mobile devices with a good generated voice.
You can use a LJSPeech pretrained model to fine-tune on ur patient-specific data. Sice ur dataset is small (100 recorđe phrases) then there are many words missing so you just need fine-tune speaker-embedding layers and add some FC layers in the end of FastSpeech2 (you can also fine-tune PostNet in FastSpeech2) model to let model transfer the American accent to British accent. I will make a PR to let the model training only some layers rather than all layers :D.
About Mb-melgan, you can train on a larger dataset with many speakers to achieves a universal vocoder so you can use this universal version for your FastSpeech2 without fine-tuning.

@ZDisket can you share some ur experiences when finetune a voice from female->male in ur small dataset :D .

ZDisket · 2020-10-12T05:07:32Z

@OscarVanL @dathudeptrai
FastSpeech2 is definitely the right architecture, it's very tolerant of small datasets (my guess is because it doesn't have to learn to align them); I've had success finetuning on even 80 seconds of audio, although that was female -> female, but there shouldn't be a problem with male voices, which I've also had success on.
Although I've had little success when finetuning mb-melgan, as there is always a lot of loss or background noise (which is why I integrated RNNoise into my frontend), so universal vocoder is the way to go.

OscarVanL · 2020-10-12T07:23:38Z

Wow, thank you both for the detailed replies. That's really helpful!

@dathudeptrai Thank you for offering to make a PR to help train selected layers.

@ZDisket It's great to hear your success even with a limited dataset. Fortunately we have much more than 80 seconds of audio even in the worst cases.

Could you explain the idea of a universal vocoder to me? How is it possible to get a customised voice using a universal vocoder without fine tuning?

This is all very new to me, but very exciting.

ZDisket · 2020-10-12T18:35:22Z

@OscarVanL Conventional text2speech works with a text2mel model, which converts text to spectrograms, and vocoder, which turns spectrograms into audio. Training a vocoder on many, many different voices can achieve a "universal vocoder" which can adapt to almost any speaker. I know the owner of vo.codes uses a (MelGAN) universal vocoder. You'll still have to finetune the text2mel though.

OscarVanL · 2020-10-12T19:43:54Z

Thank you for the explanation.

So my understanding is that I will have to train a FastSpeech2 text2mel model to create patient-specific mel spectrograms. This will involve me taking a LJSpeech pretrained model, then fine-tuning as described by @dathudeptrai with patient voice data.

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

ZDisket · 2020-10-12T22:22:22Z

@OscarVanL

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

There are three MelGANs: regular MelGAN (lowest quality), ditto + STFT loss (somewhat better), and Multi-Band (best quality and faster inference), you can hear the differences in the demo page. There's also ParallelWaveGAN, but it's too slow on CPU to consider.

As for pretrained models, there are none trained natively with this repo on large multispeaker datasets(I have two trained on about 200 speakers, one 32KHz and other 48KHz, but it doesn't work well outside of them), but there are notebooks to convert trained models from kan-bayashi's repo: https://github.com/kan-bayashi/ParallelWaveGAN (which has a lot) to this one's. I forgot where they were, so you'll have to ask @dathudeptrai.

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

A mel spectrogram is a mel spectrogram no matter where it comes from, so yes, as long as the text2mel and vocoder's data is processed the same (same normalization method, mel frequency range, etc).

OscarVanL · 2020-10-12T22:41:03Z

Thank you once again for helping with my noob questions! I'll definitely check out that resource with trained models.

Zak-SA · 2020-10-13T10:08:45Z

that's interesting subject,
Any example how to fine tune mb-melgan with a pretraind model? in readme it only says Just load pretrained model and training from scratch with other languages. can you explain more?
thanks

dathudeptrai · 2020-10-13T15:37:32Z

@OscarVanL i just make a PR for custom trainable layers here (#299).

@Zak-SA You can try to train universal vocoder or load the weight from pretrained model list then training as normal (follow README.).

OscarVanL · 2020-10-13T15:55:07Z

Amazing, thank you to both of you for going above and beyond to help!

A few more questions as I didn't see any documentation on preparing the dataset, I'm looking to prepare some data for fine-tuning.

Do I need to strip grammar from the text? Eg: ()`';"-

Are there any other similar cases I should consider when preparing the transcriptions?

Does the audio filetype matter? I have 44100Hz Signed 16-bit PCM WAVs. (Edit: These files produced no errors during preprocessing/normalisation, but they should be mono, not stereo)

OscarVanL · 2020-10-14T18:39:12Z

Some early observations going through the steps in examples/mfa_extraction/README.md and examples/fastspeech2_libritts/README.md with my own dataset...

Your dataset should be in mono, or else during one of these steps the script will fail.
Your dataset should not use dashes in the name. My dataset was named as audio-1.wav, audio-2.wav. In fix_mismatch.py this will cause the script to fail.
The sampling rate will automatically be down-sampled from 44100MHz to the required 24000Hz.
16-bit PCMs are fine.
Audio clips should not exceed 15 seconds in duration, or you will run out of memory when training the model.

OscarVanL · 2020-10-16T12:31:03Z

Hi,

I've begun fine-tuning with the guidance given by @dathudeptrai :)

I've taken the LJSpeech pretrained model "fastspeech2.v1" to fine-tune.

I took the fastspeech2.v1.yaml config (designed for LJSpeech dataset), and made only one change, I set var_train_expr: embeddings from the PR dathudeptrai made. I was unsure what other hyperparameters to change.

Here you can see the TensorBoard results for training the embedding layers...

Using the fastspeech2_inference notebook, followed by the multiband_melgan_inference notebook using the libritts_24k.h5 Universal vocoder I got these results...

At 5000 steps: audio, spectogram

At 15000 steps: audio, spectogram

At 80000 steps: audio, spectogram

Obviously, this sounds bad because I have only trained embedding layers.

I would now like to add some FC layers at the end, as you suggested, but am not sure how I do this.

Based on my tensorboard results, how many steps do you think I should tune the embedding layers before I stop and begin to train the FC layers?

Do you advise making any changes to the hyperparameters in fastspeech2.v1.yaml?

dathudeptrai · 2020-10-16T15:09:50Z

@OscarVanL can you try to train all network ?, var_train_expr: null and report the tensorboard here then I can give you the right way to go :D.

OscarVanL · 2020-10-17T11:01:02Z

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.

dathudeptrai · 2020-10-17T16:19:34Z

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.

all speaker specific layer (because ur dataset has a different speaker with a pre-trained model).
phoneme embeddings (note that ur pre-trained model is used character rather than phoneme so you should train the phoneme embeddings from scratch).
F0/energy/duration should be retrained also because this is a speaker characteristic.
mel_before/postnet should and first layer of decoder should retrained also.

I do not know if it work or not because the pretrained u are using is charactor-based, you should find phoneme pretrained then you do not need fine-tune phoneme embeddings. @ZDisket do you have any FS2 phoneme pretrained ?

OscarVanL · 2020-10-17T16:35:28Z

Ok I will try that now 👍 Thanks!

dathudeptrai · 2020-10-17T16:44:02Z

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

OscarVanL · 2020-10-17T16:48:52Z

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

dathudeptrai · 2020-10-17T17:30:29Z

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

I think it is 44100 but we may need ask @machineko

machineko · 2020-10-17T17:37:47Z

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

dathudeptrai · 2020-10-17T18:07:20Z

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

1 vote for downsampling first :))) @OscarVanL

OscarVanL · 2020-10-17T18:16:10Z

I agree. I think downsampling first will avoid any confusion or mistakes.

ZDisket · 2020-10-17T18:40:00Z

@dathudeptrai I have two phoneme LJSpeeches, 22KHz and (upsampled) 24KHz with LibriTTS preprocessing settings like in kan-bayashi repo. But the phoneme IDs might differ

OscarVanL · 2020-10-17T19:27:22Z

OK, I have downsampled to 24000Hz, redone all of the mfa extraction, preprocessing, normalisation, and changed hop_size to 240. I am training the layers you suggested. I will update you with a new tensorboard tomorrow :) Thank you for all your comments.

OscarVanL · 2020-10-18T10:40:56Z

@dathudeptrai Here's my TensorBoard for that last attempt.

dathudeptrai · 2020-10-19T02:27:40Z

@dathudeptrai Here's my TensorBoard for that last attempt.

the model overfits too much, in this case, i think you should pretrained ur model by libriTTS dataset then you do not need retrained embeddings layers. Seems in ur validation data, there are many words/phoneme that the model has not seen in the training data (you can check this statement), that is why the valid loss increase why training loss decrease.

OscarVanL · 2020-11-08T20:36:15Z

Yes that would probably help. I will have a look at which British speaker corpuses are available.

I see "M-AILABS Queen's English corpus", however the Queen's English likely does not represent how normal British people actually speak 😆

GavinStein1 · 2020-11-09T03:08:03Z

Hi there, quick question about your speech inference.

Where do you pass in the speaker_id? I am going through the colab and fastspeech2 notebooks and I can't see reference to it anywhere...

OscarVanL · 2020-11-09T10:14:44Z

@GavinStein1
Inference looks like this:

mel_before, mel_outputs, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)

See the part speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32), that number is changed to the speaker you desire.

After processing your dataset (assuming you use LibriTTS), you will see libritts_mapper.json, inside this there is a mapping from the speaker's folder name to the speaker ID you need to pass into the inference.

GavinStein1 · 2020-11-10T02:04:40Z

@GavinStein1
Inference looks like this:
mel_before, mel_outputs, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32)
)
See the part speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32), that number is changed to the speaker you desire.

After processing your dataset (assuming you use LibriTTS), you will see libritts_mapper.json, inside this there is a mapping from the speaker's folder name to the speaker ID you need to pass into the inference.

I see what you are referring to, however when processing libriTTS, this libritts_mapper.json was not generated/editted. It is still the same as the file I cloned from this repo (i.e. only 20 speakers and the same 20 speakers). I followed the libriTTS setup steps and preproccessing steps. Am I missing something?

OscarVanL · 2020-11-10T02:09:59Z

Maybe you're looking in the wrong folder, the one in TensorFlowTTS/tensorflow_tts/processor/pretrained/ is for the pretrained models on this repo.

Instead, when you run the preprocessing stages, a new libritts_mapper.json file should appear in your dump folder (the one you set as tensorflow-tts-preprocess --outdir). This is specific to the dataset you're training with.

GavinStein1 · 2020-11-10T02:27:09Z

Maybe you're looking in the wrong folder, the one in TensorFlowTTS/tensorflow_tts/processor/pretrained/ is for the pretrained models on this repo.

Instead, when you run the preprocessing stages, a new libritts_mapper.json file should appear in your dump folder (the one you set as tensorflow-tts-preprocess --outdir). This is specific to the dataset you're training with.

Now I feel embarrassed for not seeing that earlier... Thank you

OscarVanL · 2020-11-10T02:29:56Z

Happens to the best of us :)

OscarVanL · 2020-11-10T17:00:36Z

@OscarVanL It seems to me by looking the layer naming, the speaker embedding and speaker fc are the layers you need to retrain or fine tune. Your current fine-tuning seems involve many layers and parameters, it should be easy to overfit with that small amount of data.

@ronggong I tried tuning only these layers, but now the model does not converge. After 3 hours there is no improvement and the model still sounds American. (red is tuning job, grey is the model I am tuning from 110k steps)

I think I will retrain the LJSpeech model with a mixture of LibriTTS and some British speaker dataset (as machineko said), hopefully, this will help it generalise to my British speaker better.

OscarVanL · 2020-11-11T12:20:44Z

Add few British speaking people to dataset :P

@machineko's suggestion to use more British speakers was a great suggestion, because it led me to find LibriVox accents table and British Readers on LibriVox. As LibriTTS is based on LibriVox, I found nearly all these speakers within the train-clean-100, train-clean-360, and train-other-500 LibriTTS subsets and created a new British speakers dataset with 74 speakers and 17 hours of data 😁

Training on just these speakers did not give a great FS2 model, I think 17 hours may be too little, so I'll add in a 50:50 split of British and American LibriTTS speakers to match the good results I had with 34 hours of speech. Hopefully, this will make my speaker sound much better.

OscarVanL · 2020-11-13T11:57:05Z

I am really happy with the Britsh models I am getting from the dataset, I feel like I am getting some models I am really satisfied with now! 🤟

Fine-tuning the vocoder definitely helped reduce the buzzing, but didn't eliminate it, but at this point, I am happy with the results. It took 3 days 9 hours to reach 1M steps though 😴

Now to fine-tune it all over again with my British speakers 😅

vocajon · 2020-11-13T12:47:08Z

I am really happy with the Britsh models I am getting from the dataset, I feel like I am getting some models I am really satisfied with now! 🤟

@OscarVanL Well done! Great work.

Are you in a position to share or publish your base British dataset, or trained model please?

OscarVanL · 2020-11-13T12:58:11Z

@vocajon I probably should not share my model because of academic integrity (this is for a University project), but I was just writing a Blog post about compiling the British speaker corpus I used. It is based entirely on speakers taken from the LibriTTS dataset, so in theory, it should be open source but I must check this :) I will let you know once it's published 😄

OscarVanL · 2020-11-13T15:57:48Z

@vocajon

Here's the blog post.

I created a repo with my LibriTTS British dataset. I only used the libritts-english subset, because I did not need Welsh/Scottish/Irish accents.
https://github.com/OscarVanL/LibriTTS-British-Accents

Edit: It looks like GitHub LFS is unsuitable for this purpose as it imposes bandwidth limits. I will have to look for alternatives.

vocajon · 2020-11-13T17:40:23Z

Great, thank you very much. Just curious, the purpose for your work is for patients before losing their voices right? What led to you excluding the Welsh/Scottish/Irish accents? You think it is unlikely any will appear as a patient in England? Or you think it is better to maintain 4 models (if enough data can be found for the others) and tune using the closest model?

OscarVanL · 2020-11-13T22:02:39Z

Great, thank you very much. Just curious, the purpose for your work is for patients before losing their voices right? What led to you excluding the Welsh/Scottish/Irish accents? You think it is unlikely any will appear as a patient in England? Or you think it is better to maintain 4 models (if enough data can be found for the others) and tune using the closest model?

At the moment it's only a proof-of-concept phase.
The LibriTTS dataset only has 2 Scottish, 2 Welsh, and 3 Irish speakers, with this little data adding these speakers will probably not allow any model to effectively clone a voice in any of these accents. To support these accents, a larger dataset would be required, so there's no value adding them to the dataset at present.

In regards to 1 vs 4 models, I suppose it would be a matter of experimenting to see what works. I have been training with a 50:50 split of English and American accents and it still allows me to clone a British voice well, so who knows, maybe a single mixed-accent model would work if there was enough of each accent in the training data.

GavinStein1 · 2020-11-16T23:48:38Z

Hi @OscarVanL,

Can I ask what model you used for pretraining? and did you end up training all layers or just some specific ones? and what processor did you end up using for inference after your PR?

OscarVanL · 2020-11-16T23:54:55Z

@GavinStein1
I trained starting with the pretrained fastspeech2.v1 model here.

I trained all layers.

I used this processor:

    processor = AutoProcessor.from_pretrained(
        pretrained_path="./tensorflow_tts/processor/pretrained/libritts_mapper.json"
    )

You could also use the libritts_mapper.json generated in the dataset dump, but for the processor it's only necessary to tell it which phoneme IDs to use. In this case, that is the same in both files.

GavinStein1 · 2020-11-17T04:29:36Z

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

embeddings
decoder
f0_predictor
energy_predictor
duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

OscarVanL · 2020-11-17T10:52:49Z

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

embeddings

decoder

f0_predictor

energy_predictor

duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

Sorry if my reply was confusing, I meant the Processor only uses the phoneme IDs from the mapper json (as the processor is used for mapping text to phoneme IDs).

You will of course need to also check your mapping for the correct speaker ID at inference :)

I used 120 speakers, with 17 hours of British speakers and 17 hours of American speakers.

My results were best when I picked the top 100 speakers by duration in the train-clean-360 subset. (I ended up using 120 speakers because the British speakers dataset had on average less speech)

I got these errors loading the layers too, I also asked about this problem but got no reply. I just ignored the error and the model trained fine, but maybe you should ask one of the maintainers about this.

dathudeptrai · 2020-11-17T11:00:28Z

So if the processor is used for only the phoneme ids, how do you know what speaker id is your one when it is mixed in with libritts speakers?

also how many speakers/hours of speaking did you find worked best for you?

Edit: When I use that fastspeech.v1 model as a pretrained model, I cannot load weights on the following layers due to mismatch in number of weights:

embeddings

decoder

f0_predictor

energy_predictor

duration_predictor

Did you get this issue? and did you just ignore it as you wanted to retrain those layers anyway?

you can load the weights like this:

model.load_weights(path, by_name=True, skip_mismatch=True). f0/energy/duration must be retrain.

OscarVanL · 2020-12-09T15:35:12Z

I'm closing this as all the help I received helped me train a good model! 😄

To give a tl;dr of this thread, I found the best way to tune with a small speaker dataset is to merge it with a larger multi-speaker dataset (I used LibriTTS) and pass in the speaker ID at inference for the speaker I wished to clone.

Some other tricks that helped:

Match the hop_size to the model you're fine-tuning, fft_size to the vocoder you're using.
Make sure your own data is in the right format, to match the dataset you are merging it with. In the case of LibriTTS, I used 24kHz, mono, 16-bit PCM, not exceeding 15 seconds in duration.
Dataset quality is king. This was the thing that gave the greatest improvements
I had success with ~35h of speech from ~100 speakers, but that's not to say this is optimal.
Remove speakers where they have too little speech (<15mins). One idea could be to only select speakers with the most amount of speech from train-clean-100, train-clean-360 and/or train-other-500 to avoid speakers with too little speech.
Try to match your dataset to the type of voice you're cloning. For example, I was cloning a British speaker, but merged this with a dataset of Americans. This gave poor results. When I included my British speaker corpus this made a big improvement.
Don't bother changing which layers to train with var_train_expr. Other areas like focussing on the dataset were more fruitful.

Megh-Thakkar · 2021-06-26T14:10:07Z

Hi, thanks for this extremely useful thread. I am very new to TTS and want to train on a small dataset as discussed above. I had a few clarifications (they might be very basic/naive).

Start with the FastSpeech2 model trained on LJSpeech (v1 (here)[https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2#pretrained-models-and-audio-samples]).
The dataset consists of around 40-50 speakers from LibreTTS dataset and the speaker whose voice is to be cloned.
Run mfa extraction and processing from the examples/mfa_extraction folder.
Run other preprocessing steps for LibreTTS.

I wanted to know how do you combine FastSpeech2 with MB MelGAN. What is the default vocoder being used in the examples file, and how to change it.

Can you share a pipeline script if possible @OscarVanL?

Thanks a lot.

CharlieBickerton · 2022-09-22T10:33:05Z

@OscarVanL thanks for this great thread!

You mentioned the aim of your project was to deploy onto low-end hardware, did you end up doing this? If so, what method did you use and how many mb was the model in the end?

dathudeptrai self-assigned this Oct 12, 2020

dathudeptrai added the question ❓ Further information is requested label Oct 12, 2020

ZDisket mentioned this issue Oct 13, 2020

Add converted LibriTTS universal vocoder to pt model list and model notes #300

Merged

OscarVanL mentioned this issue Oct 15, 2020

FastSpeech2 Out of Memory when allocating tensor #308

Closed

OscarVanL mentioned this issue Nov 13, 2020

FastSpeech2 Speaker IDs do not correlate with the voice at inference #366

Closed

langfield mentioned this issue Dec 9, 2020

Fine tuning fastspeech-150k.h5 (v3) on ljspeech for 2k more steps results in poor audio quality. #421

Closed

OscarVanL closed this as completed Dec 9, 2020

dathudeptrai mentioned this issue Dec 17, 2020

Indian accent English voice cloning #428

Closed

WadoodAbdul mentioned this issue Jul 5, 2021

Fine-tuning procedure for mb_melgan vocoder, Voice Quality degrading with Fine-tuning. #608

Closed

samuel-lunii mentioned this issue Aug 11, 2021

About multi-speaker datasets and tacotron2... #644

Closed

godspirit00 mentioned this issue Jan 4, 2022

About merging small dataset with large datasets #729

Closed

Fine-Tuning with a small dataset #296

Fine-Tuning with a small dataset #296

Comments

OscarVanL commented Oct 11, 2020 • edited Loading

dathudeptrai commented Oct 12, 2020

ZDisket commented Oct 12, 2020

OscarVanL commented Oct 12, 2020

ZDisket commented Oct 12, 2020 • edited Loading

OscarVanL commented Oct 12, 2020

ZDisket commented Oct 12, 2020 • edited Loading

OscarVanL commented Oct 12, 2020

Zak-SA commented Oct 13, 2020

dathudeptrai commented Oct 13, 2020

OscarVanL commented Oct 13, 2020 • edited Loading

OscarVanL commented Oct 14, 2020 • edited Loading

OscarVanL commented Oct 16, 2020 • edited Loading

dathudeptrai commented Oct 16, 2020

OscarVanL commented Oct 17, 2020

dathudeptrai commented Oct 17, 2020 • edited Loading

OscarVanL commented Oct 17, 2020

dathudeptrai commented Oct 17, 2020

OscarVanL commented Oct 17, 2020 • edited Loading

dathudeptrai commented Oct 17, 2020

machineko commented Oct 17, 2020

dathudeptrai commented Oct 17, 2020

OscarVanL commented Oct 17, 2020

ZDisket commented Oct 17, 2020 • edited Loading

OscarVanL commented Oct 17, 2020

OscarVanL commented Oct 18, 2020

dathudeptrai commented Oct 19, 2020 • edited Loading

OscarVanL commented Nov 8, 2020

GavinStein1 commented Nov 9, 2020

OscarVanL commented Nov 9, 2020 • edited Loading

GavinStein1 commented Nov 10, 2020

OscarVanL commented Nov 10, 2020 • edited Loading

GavinStein1 commented Nov 10, 2020

OscarVanL commented Nov 10, 2020

OscarVanL commented Nov 10, 2020 • edited Loading

OscarVanL commented Nov 11, 2020

OscarVanL commented Nov 13, 2020

vocajon commented Nov 13, 2020

OscarVanL commented Nov 13, 2020

OscarVanL commented Nov 13, 2020 • edited Loading

vocajon commented Nov 13, 2020

OscarVanL commented Nov 13, 2020

GavinStein1 commented Nov 16, 2020

OscarVanL commented Nov 16, 2020

GavinStein1 commented Nov 17, 2020 • edited Loading

OscarVanL commented Nov 17, 2020 • edited Loading

dathudeptrai commented Nov 17, 2020 • edited Loading

OscarVanL commented Dec 9, 2020

Megh-Thakkar commented Jun 26, 2021

CharlieBickerton commented Sep 22, 2022

OscarVanL commented Oct 11, 2020 •

edited

Loading

ZDisket commented Oct 12, 2020 •

edited

Loading

ZDisket commented Oct 12, 2020 •

edited

Loading

OscarVanL commented Oct 13, 2020 •

edited

Loading

OscarVanL commented Oct 14, 2020 •

edited

Loading

OscarVanL commented Oct 16, 2020 •

edited

Loading

dathudeptrai commented Oct 17, 2020 •

edited

Loading

OscarVanL commented Oct 17, 2020 •

edited

Loading

ZDisket commented Oct 17, 2020 •

edited

Loading

dathudeptrai commented Oct 19, 2020 •

edited

Loading

OscarVanL commented Nov 9, 2020 •

edited

Loading

OscarVanL commented Nov 10, 2020 •

edited

Loading

OscarVanL commented Nov 10, 2020 •

edited

Loading

OscarVanL commented Nov 13, 2020 •

edited

Loading

GavinStein1 commented Nov 17, 2020 •

edited

Loading

OscarVanL commented Nov 17, 2020 •

edited

Loading

dathudeptrai commented Nov 17, 2020 •

edited

Loading