Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OverFlow #2183

Merged
merged 32 commits into from
Dec 12, 2022
Merged

Adding OverFlow #2183

merged 32 commits into from
Dec 12, 2022

Conversation

shivammehta25
Copy link
Collaborator

@shivammehta25 shivammehta25 commented Dec 4, 2022

This is the model from the paper: https://arxiv.org/abs/2211.06892
Audio samples: https://shivammehta25.github.io/OverFlow/

@CLAassistant
Copy link

CLAassistant commented Dec 4, 2022

CLA assistant check
All committers have signed the CLA.

@king-dahmanus
Copy link

Good idea, from the samples I've hird the thing is quite good. What about adding neural hmm, or is it the same thing but just upgraded?

@shivammehta25
Copy link
Collaborator Author

shivammehta25 commented Dec 5, 2022

It shares neural hmm as its core instead of attention. The benefits of neural HMM TTS are that it's almost half the number of parameters and it works very well even in a low resource setting i.e when we don't have enough data to train on. Once we merge this it would require very little change to add the neural HMM TTS into the system, which I plan to do as well.

@king-dahmanus
Copy link

That's nice. How fast is it on the cpu? I previously suggested improving its speed for screen readers but I didn't realize how foolish that was untill recently. So how fast, or faster/slower is it compared to tacotron2 with hifigan or vits?

@shivammehta25 shivammehta25 changed the title [WIP] Adding OverFlow Adding OverFlow Dec 9, 2022
@shivammehta25 shivammehta25 marked this pull request as ready for review December 9, 2022 09:52
@shivammehta25 shivammehta25 requested a review from erogol December 9, 2022 10:11
@erogol
Copy link
Member

erogol commented Dec 9, 2022

Cool the PR is ready. I'll first try the LJSpeech recipe and let you know how it goes.

# Process Autoregression
h_memory, c_memory = self._process_ar_timestep(t, ar_inputs, h_memory, c_memory)
# Get mean, std and transition vector from decoder for this timestep
# Note: Gradient checkpointing currently doesn't works with multiple gpus inside a loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a blocker to use multi-gpu we should explain this in the model docstring and docs too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a model specific issue or rooted from torch

Copy link
Collaborator Author

@shivammehta25 shivammehta25 Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a torch issue gradient checkpointing in a loop is currently not supported in DDP. It works fine for Multi-GPU if we turn off the flag by use_grad_checkpointing=False but will significantly increase memory usage while training. This is because to compute the actual data likelihood (not an approximation using MAS/Viterbi) we must use all the states at the previous time step during the forward pass to decide the probability mass at the current step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some information please take a look and see if it needs any more explanation.

@shivammehta25 shivammehta25 requested a review from erogol December 10, 2022 09:44
@Edresson
Copy link
Contributor

Great PR @shivammehta25 Thanks for the contribution :).

@shivammehta25 shivammehta25 requested review from Edresson and erogol and removed request for erogol and Edresson December 10, 2022 16:58
@shivammehta25
Copy link
Collaborator Author

shivammehta25 commented Dec 10, 2022

Oh my bad! I clicked one too many time the request for review button. Sorry for the spam xP.
And @Edresson Thank you very much :D Glad you liked it!

@erogol
Copy link
Member

erogol commented Dec 12, 2022

@shivammehta25 how do you compute lj_parameters.pt?

@shivammehta25
Copy link
Collaborator Author

Inside TTS/tts/layers/overflow/comon_layers.py. There exists OverflowUtils.get_data_parameters_for_flat_start which is called in on_init_start of the model. It loads the training data and computes the means and std (a single scalar) and for the whole training set to standardise the data during training (the model has mean and std as registered_buffers). This is used during the initialization in OutputNet, where the weights of the last layer are set to zeros and bias is set to 0 and 1 (a cool hack from fixup initialization). It also computes an average transition probability to start with an "ideal diagonal alignment". I sort of "cache" these lj_parameters.pt them near the cached phonemes.

@erogol
Copy link
Member

erogol commented Dec 12, 2022

Ok thanks. I missed it for some reason. I also trained an LJSpeech model. It works great.

But I think we also need to train a vocoder that we can do separately.

Mergin it now 👍

@erogol erogol merged commit 3b8b105 into coqui-ai:dev Dec 12, 2022
@shivammehta25
Copy link
Collaborator Author

I tried synthesising waveforms with the universal hifigan vocoder it works pretty well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants