Skip to content

Kokoro-Align is a PyTorch speech-transcript alignment tool for LibriVox. It splits audio files in silent positions and find CTC best path to align transcript texts with the audio files.

License

Notifications You must be signed in to change notification settings

kaiidams/Kokoro-Align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kokoro-Align

Kokoro-Align is a PyTorch speech-transcript alignment tool for LibriVox. It splits audio files in silent positions and find CTC best path to align transcript texts with the audio files. Kokoro-Align is used for Kokoro Speech Dataset.

Objectives

  • Not depend on non-commercially licensed datasets

How to train CTC model

CTC model predicts phonemes from MFCC audio features. You can download the pretrained model checkpoint and skip this process.

Download data

Get Kokoro Speech Dataset and extract the data under ./data. ./data/kokoro-speech-v1_2-small/metadata.csv should be the path of the transcript data.

Preprocessing

Run this to preprocess Kokoro corpus.

$ python -m kokoro_align.prepare \
    --dataset kokoro \
    --data data/kokoro-speech-v1_2-small

This generates two files. data/kokoro-text.npz contains phonemes and data/kokoro-audio.npz contains MFCC features.

Run training

Run this to train the CTC model.

$ python -m kokoro_align.train \
    --train --dataset kokoro --model-dir model/ctc

It achieve loss similar to this after 100 epochs.

train epoch 199: 100% 65/65 [00:20<00:00,  3.16it/s, loss=0.156]
train epoch 199: 100% 8/8 [00:00<00:00, 18.63it/s]
Avg loss: 0.306335

How to build Kokoro-Speech-Dataset

You can use the model trained in the above process or use the pretraine model

Download audio data

$ mkdir -p data
$ (cd data && curl -LO http://archive.org/download/gongitsune_um_librivox/gongitsune_um_librivox_64kb_mp3.zip)
$ unzip data/gongitsune_um_librivox_64kb_mp3.zip -d data/gongitsune-by-nankichi-niimi
$ ls data/gongitsune-by-nankichi-niimi/*.mp3 | sort > data/gongitsune_audio_files.txt
$ sed -e 's/\.mp3$/.plain.txt/' data/gongitsune_audio_files.txt > data/gongitsune_original_text_files.txt

You can see a shell script to download data by running

$ python run_example.py --download --dataset gongitsune-by-nankichi-niimi

Make metadata

$ python run_example.py

Copy index

You can use output directory to make datasets with Kokoro Speech Dataset using output direcotry.

$ python run_example.py --copy-index

Dataset

About

Kokoro-Align is a PyTorch speech-transcript alignment tool for LibriVox. It splits audio files in silent positions and find CTC best path to align transcript texts with the audio files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages