Voice100

Voice100 includes neural TTS/ASR models. Inference of Voice100 is low cost as its models are tiny and only depend on CNN without recursion.

Objectives

Don't depend on non-commercially licensed dataset
Small enough to run on normal PCs, Raspberry Pi and smartphones.

Sample synthesis

Sample synthesis 1 beginnings are apt to be determinative and when reinforced by continuous applications of similar influence
Sample synthesis 2 which had restored the courage of noirtier for ever since he had conversed with the priest his violent despair had yielded to a calm resignation which surprised all who knew his excessive affection
Sample synthesis 1 また、東寺のように五大明王と呼ばれる主要な明王の中央に配されることも多い。
Sample synthesis 2 ニューイングランド風は牛乳をベースとした白いクリームスープでありボストンクラムチャウダーとも呼ばれる

Architecture

TTS

TTS model is devided into two sub models, align model and audio model. The align model predicts text alignments given a text. An aligned text is generated from the text and the text alignments. The audio model predicts WORLD features (F0, spectral envelope, coded aperiodicity) given the aligned text.

Alignment network

graph TD
    A[Input text] -->|hello| B(Embedding)
    B --> C(1D inverted residual x4)
    C --> D(Convolution)
    D -->|h:0,1 e:0,2 l:1,1 l:1,1 o:1,2| E[Alignment]

Audio network

graph TD
    A[Aligned text] -->|_hee_l_l_oo| B(Embedding)
    B --> C(1D inverted residual x4)
    C --> D(1D transpose convolution)
    D --> E(1D inverted residual x3)
    E --> F(Convolution)
    F --> G[WORLD parameters]

TTS align model

  | Name      | Type       | Params
-----------------------------------------
0 | embedding | Embedding  | 14.8 K
1 | layers    | Sequential | 8.6 M 
-----------------------------------------
8.6 M     Trainable params
0         Non-trainable params
8.6 M     Total params
17.137    Total estimated model params size (MB)

TTS audio model

  | Name      | Type         | Params
-------------------------------------------
0 | embedding | Embedding    | 14.8 K
1 | decoder   | VoiceDecoder | 11.0 M
2 | norm      | WORLDNorm    | 518   
3 | criterion | WORLDLoss    | 0     
-------------------------------------------
11.1 M    Trainable params
518       Non-trainable params
11.1 M    Total params
22.120    Total estimated model params size (MB)

Align model pre-processing

The input of the align model is sequence of tokens of the input text. The input text is lower cased and tokenized into characters and encoded by the text encoder. The text encoder has 28 characters in the vocabulary, which includes lower alphabets, a space and an apostrophy. All characters which are not found in the vocabulary, are removed.

Align model post-processing

The output of the align model is sequence of pairs of timings which length is the same as the number of input tokens. A pair has two values, number of frames before the token and number of frames for the token. One frame is 20ms. An aligned text is generated from the input text and pairs of timings. The length of the aligned text is the number of total frames for the audio.

Audio model pre-processing.

The input of the audio model is the encoded aligned text, which is encoded in the same way as the align model pre-processing, except it has one added token in the vocabulary for spacing between tokens for the original text.

Audio model post-processing.

The output of the audio model is the sequence of F0, F0 existences, log spectral envelope, coded aperiodicity. A F0 existence is a boolean value, which is true when F0 is available false otherwise. F0 is forced into 0 when F0 existence is false. One frame is 10ms. The length of the output is twice as the length of the input.

ASR

The ASR model is 9-layer MobileNet-like inverted residual which is trained to predict on CTC loss.

ASR network

graph TD
    A[Mel spectrogram] --> B(1D inverted residual x 12)
    B --> C(Convolution)
    C --> G[Logits of aligned text]

  | Name          | Type                          | Params
----------------------------------------------------------------
0 | encoder       | ConvVoiceEncoder              | 11.6 M
1 | decoder       | LinearCharDecoder             | 14.9 K
2 | loss_fn       | CTCLoss                       | 0     
3 | batch_augment | BatchSpectrogramAugumentation | 0     
----------------------------------------------------------------
11.6 M    Trainable params
0         Non-trainable params
11.6 M    Total params
23.243    Total estimated model params size (MB)

Align model

The align model is 2-layer bi-directional LSTM which is trained to predict aligned texts from MFCC audio features. The align model is used to prepare aligned texts for dataset to train the TTS models.

  | Name          | Type                          | Params
----------------------------------------------------------------
0 | conv          | Conv1d                        | 24.7 K
1 | lstm          | LSTM                          | 659 K 
2 | dense         | Linear                        | 7.5 K 
3 | loss_fn       | CTCLoss                       | 0     
4 | batch_augment | BatchSpectrogramAugumentation | 0     
----------------------------------------------------------------
691 K     Trainable params
0         Non-trainable params
691 K     Total params
1.383     Total estimated model params size (MB)

Training

Train ASR model

voice100-prepare-dataset \
  --dataset ljspeech \
  --language en \
  --use_phone

voice100-prepare-dataset \
  --dataset librispeech \
  --language en \
  --use_phone

voice100 fit \
  --config config/asr_en_phone_base.yaml \
  --trainer.accelerator gpu \
  --trainer.devices 1 \
  --trainer.precision 16 \
  --trainer.default_root_dir ./outputs/asr_en_phone_base \

Align text with small ASR model

This generates the aligned text as data/${DATASET}-phone-align.txt.

voice100-align-text \
  --batch_size 4 \
  --dataset ljspeech \
  --language en \
  --use_phone \
  --checkpoint asr_en_phone_small-20230309.ckpt

Train TTS align model

voice100 fit --config voice100/config/align_en_phone_base.yaml \
  --trainer.accelerator gpu \
  --trainer.devices 1 \
  --trainer.precision 16 \
  --trainer.default_root_dir=./outputs/align_en_phone_base

Compute audio statistics

This generates the statistics as data/${DATASET}-stat.pt.

voice100-calc-stat \
    --dataset ljspeech \
    --language en \
    --output data/audio-stat.py

Train TTS audio model

voice100 fit --config voice100/config/tts_en_phone_base.yaml \
  --trainer.accelerator gpu \
  --gpus 1 \
  --precision 16 \
  --trainer.default_root_dir=./outputs/tts_en_phone_base

Exporting to ONNX

voice100-export-onnx \
    --checkpoint model/${MODEL}/lightning_logs/version_0/checkpoints/last.ckpt

CMU models

CMU models is a model that use the output of G2p_en as text representation instead of raw text.

Training CMU models

These commands convert texts in the dataset into ./data/[dataset]-phone-[split].txt. Then run voice100-train-[model] with --use-phone.

voice100-prepare-dataset \
    --dataset ljspeech
voice100-prepare-dataset \
    --dataset librispeech \
    --split train
voice100-prepare-dataset \
    --dataset librispeech \
    --split val

CMU multi-task model

CMU multitask model is a variant of TTS audio model which input is an aligned text and outputs are WORLD vocoder parameters and CMU phonemes. To train CMU multi-task model, we need alignment data for English and CMU phonemes.

./data/ljspeech-align-train.txt
./data/ljspeech-phone-align-train.txt

Then run

MODEL=ttsaudio_en_mt_conv_base

voice100-train-ttsaudio-mt \
  --gpus 1 \
  --dataset ${DATASET} \
  --language ${LANGUAGE} \
  --batch_size 32 \
  --precision 16 \
  --max_epochs 150 \
  --default_root_dir ./model/${MODEL}

Inference

Use Voice100 runtime and exported ONNX files.

Pretrained models

Name	Model Class	Dataset	Download
asr_en_small-20230225	AudioToAlignText	LibriSpeech, LJ Speech 1.1	download
asr_en_base-20230319	AudioToAlignText	LibriSpeech, LJ Speech 1.1	download
asr_en_phone_small-20230309	AudioToAlignText	LibriSpeech, LJ Speech 1.1	download
asr_en_phone_base-20230314	AudioToAlignText	LibriSpeech, LJ Speech 1.1	download
asr_ja_phone_small-20230104	AudioToAlignText	Common Voice 12.0 ja	download
asr_ja_phone_base-20230104	AudioToAlignText	Common Voice 12.0 ja	download
align_en_base-20230401	TextToAlignText	LJ Speech 1.1	download
tts_en_base-20230407	AlignTextToAudio	LJ Speech 1.1	download
align_en_phone_base-20230407	TextToAlignText	LJ Speech 1.1	download
tts_en_phone_base-20230401	AlignTextToAudio	LJ Speech 1.1	download
align_ja_phone_base-20230203	TextToAlignText	Kokoro Speech v1.2 large	download
tts_ja_phone_base-20230204	AlignTextToAudio	Kokoro Speech v1.2 large	download
asr_en_base-20210628 (deprecated)	AudioAlignCTC	LJ Speech 1.1	download
align_en_lstm_base_ctc-20210628 (deprecated)	AudioAlignCTC	LJ Speech 1.1	download
align_en_phone_lstm_base_ctc-20220103 (deprecated)	AudioAlignCTC	LJ Speech 1.1	download
align_ja_lstm_base_ctc-20211116 (deprecated)	AudioAlignCTC	Kokoro Speech v1.1 small	download
align_ja_phone_lstm_base_ctc-20221230 (deprecated)	AudioAlignCTC	Kokoro Speech v1.1 small	download
ttsalign_en_conv_base-20220409 (deprecated)	TextToAlignTextModel	LJ Speech 1.1	download
ttsalign_en_phone_conv_base-20220409 (deprecated)	TextToAlignTextModel	LJ Speech 1.1	download
ttsalign_ja_conv_base-20220411 (deprecated)	TextToAlignTextModel	Kokoro Speech v1.1 small	download
ttsaudio_en_conv_base-20220107 (deprecated)	AlignTextToAudioModel	Kokoro Speech v1.1 small	download
ttsaudio_en_phone_conv_base-20220105 (deprecated)	AlignTextToAudioModel	LJ Speech 1.1	download
ttsaudio_ja_conv_base-20220416 (deprecated)	AlignTextToAudioModel	Kokoro Speech v1.1 small	download
ttsaudio_en_mt_conv_base-20220316 (deprecated)	AlignTextToAudioMultiTaskModel	LJ Speech 1.1	download
asr_en_conv_base_ctc-20220126 (deprecated)	AudioToTextCTC	LibriSpeech	download
asr_en_phone_conv_base_ctc-20220107 (deprecated)	AudioToTextCTC	LibriSpeech	download
stt_ja_conv_base_ctc-20211127 (deprecated)	AudioToTextCTC	Common Voice 6.1 ja	download
asr_ja_phone_conv_base_ctc-20221225 (deprecated)	AudioToTextCTC	Common Voice 6.1 ja	download

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
config		config
docs		docs
tests		tests
voice100		voice100
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice100

Objectives

Sample synthesis

Architecture

TTS

TTS align model

TTS audio model

Align model pre-processing

Align model post-processing

Audio model pre-processing.

Audio model post-processing.

ASR

Align model

Training

Train ASR model

Align text with small ASR model

Train TTS align model

Compute audio statistics

Train TTS audio model

Exporting to ONNX

CMU models

Training CMU models

CMU multi-task model

Inference

Pretrained models

About

Releases 23

Packages

Languages

License

kaiidams/voice100

Folders and files

Latest commit

History

Repository files navigation

Voice100

Objectives

Sample synthesis

Architecture

TTS

TTS align model

TTS audio model

Align model pre-processing

Align model post-processing

Audio model pre-processing.

Audio model post-processing.

ASR

Align model

Training

Train ASR model

Align text with small ASR model

Train TTS align model

Compute audio statistics

Train TTS audio model

Exporting to ONNX

CMU models

Training CMU models

CMU multi-task model

Inference

Pretrained models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 23

Packages 0

Languages

Packages