New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[TTS] Add config and modules for 22khz and 44khz audio codec #10107

Merged

rlangman merged 3 commits into main from codec_audio

Aug 28, 2024

Collaborator

rlangman commented Aug 9, 2024

What does this PR do ?

Add config files and corresponding modules for optimized audio codec training.

Collection: [TTS]

Changelog

Add config files for training 22khz and 44khz audio codec.
Add inverted HiFiGAN audio encoder
Add multi resolution STFT discriminator from DAC. This does not make much difference by itself, but I changed in preparation for doing mixed bandwidth training.
Added a "half snake" activation to decoder which uses half snake and half leaky relu activations, allowing the model to jointly model both periodic and non-periodic information. When tuning hyperparameters, I found this to be more stable (and use less memory) than training with all snake activations. And perform better than using the existing snake alpha, and had similar performance to the snake beta variants from BigVGAN 1 and 2. I did not add snake activation to encoder as doing so significantly reduced TTS performance on the codec.
Replaced tanh activation on output with clamping, as in BigVGAN 2, as this seemed to reduce artifacts early in training.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

rlangman requested review from KunalDhawan, nithinraok, Edresson and anteju

August 9, 2024 22:04

github-actions bot added TTS ASR labels

anteju added the Run CICD label

Edresson previously approved these changes

View reviewed changes

nithinraok requested changes

View reviewed changes

examples/tts/conf/audio_codec/audio_codec_22050.yaml Show resolved Hide resolved

examples/tts/conf/audio_codec/audio_codec_22050.yaml Outdated Show resolved Hide resolved

examples/tts/conf/audio_codec/audio_codec_44100.yaml Outdated

		@@ -0,0 +1,194 @@
		# This config contains the default values for training 44.1kHz audio codec model which encodes mel spectrogram
		# instead of raw audio.

Collaborator

nithinraok Aug 12, 2024

Is this mel spectrogram codec or raw audio codec?

Collaborator Author

rlangman Aug 13, 2024

Changed description to raw audio codec.

nemo/collections/asr/parts/utils/activations.py Outdated

@@ @@ -15,7 +15,7 @@ @@
               import torch
               import torch.nn as nn
-              __all__ = ['Swish', 'Snake']
+              __all__ = ['Swish', 'Snake', 'HalfSnake']

Collaborator

nithinraok Aug 12, 2024

Wondering if we should move Snake and HalfSnake to audio collection instead of asr

Collaborator

anteju Aug 12, 2024

Maybe we should move the whole thing to common, so we can import it from anywhere without ending up with circular imports, etc?

Collaborator Author

rlangman Aug 12, 2024

If I add it to common, should it be in the activation_registry? The registry only seems useful if the activation does not require input arguments.

Otherwise, it seems roundabout to have something like:

if activation in ["snake", "half_snake"]:
  self.activation = activation_registry[activation](channels)
else:
  self.activation = activation_registry[activation]()

nemo/collections/tts/modules/audio_codec_modules.py

		@@ -322,6 +324,152 @@ def forward(self, audio_real, audio_gen):
		return scores_real, scores_gen, fmaps_real, fmaps_gen


		class DiscriminatorSTFT(NeuralModule):

Collaborator

nithinraok Aug 12, 2024

Docs missing

Collaborator Author

rlangman Aug 13, 2024

Added

Collaborator

nithinraok Aug 21, 2024

pls add info of arguments. They are missing.

Collaborator Author

rlangman Aug 21, 2024

Added

nemo/collections/tts/modules/audio_codec_modules.py

		return scores, fmap


		class MultiBandDiscriminatorSTFT(NeuralModule):

Collaborator

nithinraok Aug 12, 2024

Docs missing

Collaborator Author

rlangman Aug 13, 2024

Added

Collaborator

nithinraok Aug 21, 2024

pls add information on arguments.

Collaborator Author

rlangman Aug 21, 2024

Added

anteju reviewed

View reviewed changes

examples/tts/conf/audio_codec/audio_codec_22050.yaml Outdated Show resolved Hide resolved

anteju reviewed

View reviewed changes

examples/tts/conf/audio_codec/audio_codec_22050.yaml Outdated Show resolved Hide resolved

anteju reviewed

View reviewed changes

nemo/collections/tts/modules/audio_codec_modules.py Outdated Show resolved Hide resolved

anteju reviewed

View reviewed changes

nemo/collections/tts/modules/audio_codec_modules.py Outdated Show resolved Hide resolved

anteju reviewed

View reviewed changes

nemo/collections/tts/modules/audio_codec_modules.py Show resolved Hide resolved

rlangman dismissed Edresson’s stale review via

67553ef

August 13, 2024 20:37

rlangman force-pushed the codec_audio branch from 4dd0d9e to 67553ef Compare

August 13, 2024 20:37

github-actions bot added common and removed ASR labels

rlangman requested a review from nithinraok

August 19, 2024 16:37

nithinraok reviewed

View reviewed changes

nemo/collections/tts/modules/audio_codec_modules.py

		@@ -322,6 +324,152 @@ def forward(self, audio_real, audio_gen):
		return scores_real, scores_gen, fmaps_real, fmaps_gen


		class DiscriminatorSTFT(NeuralModule):

Collaborator

nithinraok Aug 21, 2024

pls add info of arguments. They are missing.

nemo/collections/tts/modules/audio_codec_modules.py

		return scores, fmap


		class MultiBandDiscriminatorSTFT(NeuralModule):

Collaborator

nithinraok Aug 21, 2024

pls add information on arguments.

nemo/collections/tts/modules/audio_codec_modules.py Outdated

@@ @@ -868,6 +1028,108 @@ def forward(self, inputs, input_len): @@
                       return out
+              class HiFiGANEncoder(NeuralModule):
+                  """
+                  Encoder created by inverting the HiFi-GAN decoder

Collaborator

nithinraok Aug 21, 2024

Add information on arguments

Collaborator Author

rlangman Aug 21, 2024

Added

rlangman force-pushed the codec_audio branch from 5d3f712 to 16e5779 Compare

August 21, 2024 22:00

Edresson approved these changes

View reviewed changes

rlangman requested a review from nithinraok

August 28, 2024 14:51

nithinraok approved these changes

View reviewed changes

Collaborator

nithinraok left a comment

LGTM

rlangman added Run CICD and removed Run CICD labels

rlangman and others added 3 commits

August 28, 2024 08:26


          [TTS] Add config and modules for 22khz and 44khz audio codec

1e3d00f

Signed-off-by: Ryan <rlangman@nvidia.com>


          Apply isort and black reformatting

6e016fe

Signed-off-by: rlangman <rlangman@users.noreply.github.com>


          [TTS] Add argument docstring to new modules

9a8dd30

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the codec_audio branch from 16e5779 to 9a8dd30 Compare

August 28, 2024 15:26

rlangman added Run CICD and removed Run CICD labels

anteju self-requested a review

August 28, 2024 18:54

anteju approved these changes

View reviewed changes

rlangman merged commit a860e6b into main

131 of 132 checks passed

rlangman deleted the codec_audio branch

August 28, 2024 21:20

adityavavre pushed a commit to adityavavre/NeMo that referenced this pull request


          [TTS] Add config and modules for 22khz and 44khz audio codec (NVIDIA#…

6e0c591

…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: adityavavre <aditya.vavre@gmail.com>

monica-sekoyan pushed a commit that referenced this pull request


          [TTS] Add config and modules for 22khz and 44khz audio codec (#10107)

0c6caeb

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>

tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request


          [TTS] Add config and modules for 22khz and 44khz audio codec (NVIDIA#…

a84cc57

…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

tomlifu pushed a commit to tomlifu/NeMo that referenced this pull request


          [TTS] Add config and modules for 22khz and 44khz audio codec (NVIDIA#…

a740925

…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request


          [TTS] Add config and modules for 22khz and 44khz audio codec (NVIDIA#…

11769f1

…10107)

* [TTS] Add config and modules for 22khz and 44khz audio codec

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* [TTS] Add argument docstring to new modules

Signed-off-by: Ryan <rlangman@nvidia.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Run CICD TTS