A Ray Serve-based microservice providing:
- Language Identification for ~4,017 languages (via a quantized version of MMS-LID).
- Speech-to-Text for ~1,162 languages (via MMS-1B-ALL).
- Translation for ~200 languages (via NLLB-200-distilled-600M).
- Conversational Agents coming soon
This repository is inspired by Meta’s Massively Multilingual Speech (MMS) and No Language Left Behind (NLLB) initiatives. By combining these open-source models, the goal is to surface language technologies for diverse and low-resource languages.
- Features
- Architecture Overview
- Requirements
- Installation
- Usage
- Smoke Testing
- License
- References and Acknowledgments
-
Language Identification
- Identifies the language of an audio clip.
- Uses a quantized ONNX version of MMS-LID, supporting over 4,000 language IDs.
- TODO: ensure CUDA execution on ONNX
-
Speech-to-Text
- Transcribes audio into text, using Facebook MMS-1B-ALL.
- Over 1,000 languages supported with the appropriate language adapter.
- TODO: ensure CUDA execution on ONNX
-
Text Translation
- Translates text between 200 languages using NLLB-200-distilled-600M.
- Supports ISO 639-3 language codes with script codes (e.g.,
fra_Latn
,eng_Latn
, etc.). - TODO: Add ONNX implementation
- TODO: ensure CUDA execution
-
Ray Serve Microservice
- Provides a FastAPI-based interface served by Ray.
- Automatic scaling of replicas (GPU or CPU usage can be specified).
- The LangIdDeployment handles language identification (
mms-lid-4017
). - The TranscriptionDeployment handles audio transcription (
mms-1b-all
). - The NLLBDeployment handles text translation (
nllb-200-distilled-600M
). - A single App deployment includes the FastAPI routes and ties them all together.
- Python 3.10 (recommended)
- Conda for installing dependencies
- FFmpeg (for audio processing)
- GPU is optional but recommended for faster inference
- CUDA >=12.4
-
Clone the repository:
git clone https://github.com/klebster2/mms-conversational-ai cd mms-conversational-ai
-
Set up a Python environment:
conda env create -f environment.yml conda activate
-
Start the Ray cluster (optionally in a separate terminal):
ray start --head
or
ray start --head --port=6379 --include-dashboard=true --dashboard-host=0.0.0.0 --dashboard-port=8265 --num-gpus=1
Or simply let Ray automatically start in local mode when you run the script.
-
Run the main script:
python api.py
This will:
- Initialize Ray
- Deploy the
LangIdDeployment
,TranscriptionDeployment
, andNLLBDeployment
- Start a FastAPI server with the endpoints defined in
app = FastAPI()
- Print logs as it runs any built-in smoke tests (if configured)
By default, the service will be available at http://127.0.0.1:8000
.
Visit http://127.0.0.1:8000/docs for an auto-generated Swagger UI.
-
Root –
GET /
- Redirects to
/docs
for the Swagger UI.
- Redirects to
-
Language Identification –
POST /audio/languageidentification
- Accepts an audio file (Form data).
- Returns a JSON object with the detected language code (ISO 639-3) and the autonym.
-
Speech-to-Text –
POST /audio/transcription
- Accepts an audio file (Form data) and an optional
language
query parameter. - If
language
is not provided, it will first call the language identification endpoint to guess the language. - Returns the transcription and the language code used.
- Accepts an audio file (Form data) and an optional
-
Translation –
POST /text/translation
- Accepts a JSON body with
{ "text": "...", "src_lang": "...", "tgt_lang": "..." }
. - Returns the translated text using the NLLB-200-distilled model.
- Accepts a JSON body with
Using curl
from the command line, here are some basic examples:
-
Language Identification:
curl -X POST "http://127.0.0.1:8000/audio/languageidentification" \ -H "accept: application/json" \ -F "audio=@/path/to/your_audio.mp3"
-
Transcription (with automatic language detection):
curl -X POST "http://127.0.0.1:8000/audio/transcription" \ -H "accept: application/json" \ -F "audio=@/path/to/your_audio.wav"
-
Transcription (specifying a language):
curl -X POST "http://127.0.0.1:8000/audio/transcription?language=fra" \ -H "accept: application/json" \ -F "audio=@/path/to/french_audio.wav"
-
Translation:
curl -X POST "http://127.0.0.1:8000/text/translation" \ -H "Content-Type: application/json" \ -d '{"text":"Hello, world!", "src_lang":"eng", "tgt_lang":"fra"}'
The script includes run_smoke_test_audio
and run_smoke_test_text
helper functions that download test files and call the local endpoints. When you run python your_script.py
, it performs a few sample calls:
- French “merci” (should detect
fra
) - Buriat sample (expected
bxm
) - Gettysburg address (English) for transcription
- Yoruba audio sample for transcription
- Simple English-to-French translation
You can customize or remove these tests in the __main__
section.
- Code: CC0 1.0 – You’re free to use and adapt without restriction.
- Models: The pretrained models from Meta (MMS-LID, MMS-1B-ALL, and NLLB) are released under a CC-BY-NC-4.0 license.
- Please consult each model’s license (see their Hugging Face model pages) for usage terms and attribution requirements.
- Important: Commercial usage may be restricted under the CC-BY-NC-4.0 license.
- Meta AI: Massively Multilingual Speech (MMS)
- Meta AI: No Language Left Behind (NLLB)
- Hugging Face model repositories:
- Ray Serve for serving the FastAPI application at scale.
- ISO 639-3 Language Codes
- Unicef - Why Mother Tongue Education holds the key to unlocking every child's potential
- Letter to the UK Parliament By Lucy-Crompton Reid (Chief Executive, Wikimedia UK)
- Unesco The world atlas of languages
- Unesco World Atlas of Languages - Summary Document
- Atlas of the World’s Languages in Danger
- CIA world factbook - Languages
- UCLA Phonetics Lab Data
- ISO 639-3 Criticism
If you use or build upon this repository, please consider citing or mentioning the original models and referencing Meta’s relevant research.
Enjoy building massively multilingual conversational AI!