Sample Texts for Mozilla's Common Voice
The main GitHub repository includes all the source files for the text corpus, iOS and Android apps, as well as the server to run the service. Here, all sample texts are located in the server/data
fdirectory.
This repository, which originally started as a GitHub gist to count word occurence in Common Voice corpus, lists all Common Voice texts which are
A full list of languages are available on the Common Voice website. Note that not all languages shown in this repository are officially launched, either due to localization problems or lack of text corpus.
To run the scripts, make sure that you already have a copy of Common Voice repository on the same directory where you will put/clone the common-text
directory. For simplicity I recommend to locate both under your Home directory.
./
|-common-text/
| |-scripts/
| | |-cv-count-latin.sh // Script
| |-stats/
| | |-(Locale)/
| | | |-... // Copy host
| |-...
|-voice-web/
| |-android/
| |-common/
| |-docker/
| |-docs/
| |-ios/
| |-locales/
| |-nubis/
| |-scripts/
| |-server/
| | |-data/
| | | |-(Locale)/
| | | | |-... // Copy target
| | |-src/
| | |-...
| |-web/
| |-...
|-...
I welcome any pull requests on improving the extraction scripts. As of now it is implemented in bash (Linux) and does not work for non-Latin scripts (e.g. Arabic, Chinese).
If you would like to contribute more sample texts to this repository, please visit the Common Voice Sentence Collector. Any direct contributions to the sample texts will be overwritten by the texts hosted in the Common Voice.
To learn more about this project, or start contributing, visit voice.mozilla.org.
This project is licensed under Mozilla Public License, 2.0. See LICENSE file or https://mozilla.org/MPL/2.0/ for license details.
In accordance to Common Voice database license requirements, sample texts (located under stats//raw/ directory must be released under Public Domain (or similar licenses such as CC0, Unlicense, and WTFPL).