TrainingSpeech
is an initiative to provide open and freely reusable dataset of voices
-
for speech-to-text models training
-
on non-english languages
-
using already available data (such as audio-books).
Right now, data are extracted exclusively from audio-books and in French language. Let me know if you are intersted to contribute by creating an issue.
TrainingSpeech
comes with a CLI that automate and simplify:
- transcript extraction
- forced-alignment (using aeneas)
- validation and correction
- pick a source that have NOT been validated yet: see
python manage.py stats
and./sources.json
for more info - download assets (ie epub and mp3 files):
python manage.py download -s <SOURCE_NAME>
- check alignment:
python manage.py check-alignment <SOURCE_NAME>
(may require multiple iterations) - send a pull request with generated transcript and alignment
- retrieve epub and corresponding mp3 file and store them into
./data/epubs
and./data/mp3
(respectively) - create new source into
./sources.json
(NB: all fields are mandatory) - generate initial transcript using
python manage.py build-transcript <SOURCE_NAME>
- upload epub and mp3 files on S3
python manage.py upload -s <SOURCE_NAME>
$ sudo apt-get install -y ffmpeg espeak libespeak-dev python3-numpy python-numpy libncurses-dev libncursesw5-dev sox libsqlite3-dev
$ git clone git@gitlab.com:nicolaspanel/TrainingSpeech.git
$ pip3 install --user pipenv
$ cd TrainingSpeech
$ pipenv install --python=3.6.6
$ pipenv sync
$ pipenv shell
$ pytest
Releases are ready-to-use zip
archives containing :
- short 16kHz 16bit wav audio speeches (0-15s)
- a single
data.csv
file with following columns:path
: path to the audio file inside the archiveduration
: audio duration in secondtext
: transcript
Name | # speeches | # speakers | Total Duration | Language |
---|---|---|---|---|
2018-11-24_fr_FR (latest) | 67577 | 4 | 95:27:21 | fr_FR |
2018-10-03_fr_FR | 67670 | 4 | 95:28:42 | fr_FR |
2018-10-02_fr_FR | 62657 | 4 | 87:23:34 | fr_FR |
2018-09-28_fr_FR | 61664 | 4 | 86:23:05 | fr_FR |
2018-09-27_fr_FR | 61658 | 4 | 86:22:43 | fr_FR |
2018-09-18_fr_FR | 44439 | 4 | 69:20:14 | fr_FR |
2018-09-05_fr_FR | 10292 | 3 | 15:55:12 | fr_FR |