A Kinyarwanda based end to end deepspeech with speech to text and text to speech services!
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Welcome to the Kinyarwanda DeepSpeech API repository! This comprehensive guide provides an in-depth exploration of this powerful end-to-end solution for speech processing in Kinyarwanda. With our DeepSpeech API, you can effortlessly convert spoken Kinyarwanda into text and transform text into natural-sounding Kinyarwanda speech. Introduction
In today's digital age, seamless communication across diverse languages is crucial. Our DeepSpeech API for Kinyarwanda bridges language barriers by offering robust speech-to-text and text-to-speech capabilities tailored specifically for the Kinyarwanda language. Whether you are building interactive voice applications, transcribing audio content, or enhancing accessibility features, our API empowers you to achieve your goals with ease. Key Features
Accurate Speech-to-Text Conversion: Leverage our advanced deep learning models to accurately transcribe spoken Kinyarwanda into written text. Our models have been trained on extensive Kinyarwanda speech datasets, ensuring high accuracy and reliability.
Natural Text-to-Speech Synthesis: Generate lifelike Kinyarwanda speech from textual input. Our text-to-speech engine produces natural intonation, rhythm, and pronunciation, creating a seamless and engaging user experience.
End-to-End Processing: Perform both speech-to-text and text-to-speech operations within a single API, streamlining your workflow and saving development time.
Customization: Fine-tune our models to adapt them to specific accents, dialects, or domains, ensuring optimal performance for your unique use case.
Scalability: Our API is designed to handle a high volume of requests, making it suitable for applications ranging from small-scale projects to large-scale enterprise solutions.
This model transcribes speech into lowercase Latin alphabet including spaces, and apostroph, and is trained on around 2000 hours of Kinyarwanda speech data by Nvidia. It is a non-autoregressive "large" variant of Conformer, with around 120 million parameters. See the model architecture and NeMo documentation for complete architecture details.
This model is an end-to-end deep-learning-based Kinyarwanda Text-to-Speech (TTS) developed by Digital Umuganda. Due to its zero-shot learning capabilities, new voices can be introduced with 1min speech. The model was trained using the Coqui's TTS library, and the YourTTS[1] architecture. It was trained on 67 hours of Kinyarwanda bible data, for 100 epochs.
This is a simpple implmentation requiring few lines of code to run.
It is highly recomended to run the application in docker container to avoid dependency errors but it also possible to run it without docker In terms of specifications needed
- With Docker:
- DISK SPACE >= 10GB
- RAM >= 2GB
- Without Docker:
- RAM >= 2GB free/spare
Follow the steps bellow to set up your project on server/machine running docker.
- Clone the repo
git clone https://github.com/agent87/RW-DEEPSPEECH-API.git
- Pull the large files with git lfs. Make sure you have git lfs installed or refer to git lfs for installation instructions
git lfs pull
- create an environment file named as ".env" with "touch .env" and paste the variables. Make sure the file is in the root directory of the project
NOTE: For security purposes, make sure to change the variables above!
MONGO_INITDB_ROOT_USERNAME="admin" MONGO_INITDB_ROOT_PASSWORD="Bingo123" MONGO_HOST="mongo" MONGO_PORT=27017 MONGO_INITDB_DATABASE="Inference" MONGO_STT_COLLECTION="STT_INFERENCE_LOGS" MONGO_TTS_COLLECTION="TTS_INFERENCE_LOGS" MAX_SPEECH_AUDIO_FILE_SIZE=1000 TTS_MAX_TXT_LEN=1000 LOG_LEVEL="INFO" PYTHONUNBUFFERED=1 DOMAIN=<Replace your DOMAIN here> SERVER_IP_ADDRESS=<Replace your SERVER_IP_ADDRESS here>
- build the docker image
Note: if you have an earlier docker version use "docker-compose build"
docker compose build
- Start the docker containers and let the magic begin
docker compose up
If you happen not to have speciazed hardware(GPU) you can run the application on Google Colab. Use the following link to open the notebook and follow the instructions in the notebook to run the application.
curl -X POST "http://server_url/stt" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@/path/to/audio/file"
curl -X POST "http://server_url/tts" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"text\":\"string\"}"
- Add database
- Add Authentication
- Testing
- CI/CD Setup tutorial
- Automated audio conversion
- OpenAPI Documentation/ Swagger
- Usage Feedback incorporation into the readme.MD
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the GNU GENERAL PUBLIC LICENSE. See LICENSE.txt
for more information.
Arnaud Kayonga - @kayarn - arnauldkayonga1@gmail.com
Project Link: https://github.com/agent87/RW-DEEPSPEECH-API
Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!