Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Add config and modules for 22khz and 44khz audio codec #10107

Merged
merged 3 commits into from
Aug 28, 2024
Merged

Conversation

rlangman
Copy link
Collaborator

@rlangman rlangman commented Aug 9, 2024

What does this PR do ?

Add config files and corresponding modules for optimized audio codec training.

Collection: [TTS]

Changelog

  • Add config files for training 22khz and 44khz audio codec.
  • Add inverted HiFiGAN audio encoder
  • Add multi resolution STFT discriminator from DAC. This does not make much difference by itself, but I changed in preparation for doing mixed bandwidth training.
  • Added a "half snake" activation to decoder which uses half snake and half leaky relu activations, allowing the model to jointly model both periodic and non-periodic information. When tuning hyperparameters, I found this to be more stable (and use less memory) than training with all snake activations. And perform better than using the existing snake alpha, and had similar performance to the snake beta variants from BigVGAN 1 and 2. I did not add snake activation to encoder as doing so significantly reduced TTS performance on the codec.
  • Replaced tanh activation on output with clamping, as in BigVGAN 2, as this seemed to reduce artifacts early in training.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Edresson
Edresson previously approved these changes Aug 12, 2024
examples/tts/conf/audio_codec/audio_codec_22050.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,194 @@
# This config contains the default values for training 44.1kHz audio codec model which encodes mel spectrogram
# instead of raw audio.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mel spectrogram codec or raw audio codec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed description to raw audio codec.

@@ -15,7 +15,7 @@
import torch
import torch.nn as nn

__all__ = ['Swish', 'Snake']
__all__ = ['Swish', 'Snake', 'HalfSnake']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should move Snake and HalfSnake to audio collection instead of asr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move the whole thing to common, so we can import it from anywhere without ending up with circular imports, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add it to common, should it be in the activation_registry? The registry only seems useful if the activation does not require input arguments.

Otherwise, it seems roundabout to have something like:

if activation in ["snake", "half_snake"]:
  self.activation = activation_registry[activation](channels)
else:
  self.activation = activation_registry[activation]()

@@ -322,6 +324,152 @@ def forward(self, audio_real, audio_gen):
return scores_real, scores_gen, fmaps_real, fmaps_gen


class DiscriminatorSTFT(NeuralModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs missing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add info of arguments. They are missing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return scores, fmap


class MultiBandDiscriminatorSTFT(NeuralModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs missing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add information on arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -322,6 +324,152 @@ def forward(self, audio_real, audio_gen):
return scores_real, scores_gen, fmaps_real, fmaps_gen


class DiscriminatorSTFT(NeuralModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add info of arguments. They are missing.

return scores, fmap


class MultiBandDiscriminatorSTFT(NeuralModule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add information on arguments.

@@ -868,6 +1028,108 @@ def forward(self, inputs, input_len):
return out


class HiFiGANEncoder(NeuralModule):
"""
Encoder created by inverting the HiFi-GAN decoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add information on arguments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rlangman and others added 3 commits August 28, 2024 08:26
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
@anteju anteju self-requested a review August 28, 2024 18:54
@rlangman rlangman merged commit a860e6b into main Aug 28, 2024
131 of 132 checks passed
@rlangman rlangman deleted the codec_audio branch August 28, 2024 21:20
adityavavre pushed a commit to adityavavre/NeMo that referenced this pull request Sep 15, 2024
…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: adityavavre <aditya.vavre@gmail.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request Oct 25, 2024
…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request Oct 25, 2024
…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants