Skip to content

Latest commit

 

History

History
232 lines (162 loc) · 8.67 KB

README.md

File metadata and controls

232 lines (162 loc) · 8.67 KB

Massively Multilingual Conversational AI

A Ray Serve-based microservice providing:

  1. Language Identification for ~4,017 languages (via a quantized version of MMS-LID).
  2. Speech-to-Text for ~1,162 languages (via MMS-1B-ALL).
  3. Translation for ~200 languages (via NLLB-200-distilled-600M).
  4. Conversational Agents coming soon

This repository is inspired by Meta’s Massively Multilingual Speech (MMS) and No Language Left Behind (NLLB) initiatives. By combining these open-source models, the goal is to surface language technologies for diverse and low-resource languages.


Table of Contents


Features

  1. Language Identification

    • Identifies the language of an audio clip.
    • Uses a quantized ONNX version of MMS-LID, supporting over 4,000 language IDs.
    • TODO: ensure CUDA execution on ONNX
  2. Speech-to-Text

    • Transcribes audio into text, using Facebook MMS-1B-ALL.
    • Over 1,000 languages supported with the appropriate language adapter.
    • TODO: ensure CUDA execution on ONNX
  3. Text Translation

    • Translates text between 200 languages using NLLB-200-distilled-600M.
    • Supports ISO 639-3 language codes with script codes (e.g., fra_Latn, eng_Latn, etc.).
    • TODO: Add ONNX implementation
    • TODO: ensure CUDA execution
  4. Ray Serve Microservice

    • Provides a FastAPI-based interface served by Ray.
    • Automatic scaling of replicas (GPU or CPU usage can be specified).

Architecture Overview

  • The LangIdDeployment handles language identification (mms-lid-4017).
  • The TranscriptionDeployment handles audio transcription (mms-1b-all).
  • The NLLBDeployment handles text translation (nllb-200-distilled-600M).
  • A single App deployment includes the FastAPI routes and ties them all together.

Requirements

  • Python 3.10 (recommended)
  • Conda for installing dependencies
  • FFmpeg (for audio processing)
  • GPU is optional but recommended for faster inference
  • CUDA >=12.4

Installation

  1. Clone the repository:

    git clone https://github.com/klebster2/mms-conversational-ai
    cd mms-conversational-ai
  2. Set up a Python environment:

    conda env create -f environment.yml
    conda activate

Usage

Running the Service

  1. Start the Ray cluster (optionally in a separate terminal):

    ray start --head

    or

    ray start --head --port=6379 --include-dashboard=true --dashboard-host=0.0.0.0 --dashboard-port=8265 --num-gpus=1

    Or simply let Ray automatically start in local mode when you run the script.

  2. Run the main script:

    python api.py

    This will:

    • Initialize Ray
    • Deploy the LangIdDeployment, TranscriptionDeployment, and NLLBDeployment
    • Start a FastAPI server with the endpoints defined in app = FastAPI()
    • Print logs as it runs any built-in smoke tests (if configured)

By default, the service will be available at http://127.0.0.1:8000. Visit http://127.0.0.1:8000/docs for an auto-generated Swagger UI.

Endpoints

  1. RootGET /

    • Redirects to /docs for the Swagger UI.
  2. Language IdentificationPOST /audio/languageidentification

    • Accepts an audio file (Form data).
    • Returns a JSON object with the detected language code (ISO 639-3) and the autonym.
  3. Speech-to-TextPOST /audio/transcription

    • Accepts an audio file (Form data) and an optional language query parameter.
    • If language is not provided, it will first call the language identification endpoint to guess the language.
    • Returns the transcription and the language code used.
  4. TranslationPOST /text/translation

    • Accepts a JSON body with { "text": "...", "src_lang": "...", "tgt_lang": "..." }.
    • Returns the translated text using the NLLB-200-distilled model.

Example Requests

Using curl from the command line, here are some basic examples:

  1. Language Identification:

    curl -X POST "http://127.0.0.1:8000/audio/languageidentification" \
         -H "accept: application/json" \
         -F "audio=@/path/to/your_audio.mp3"
  2. Transcription (with automatic language detection):

    curl -X POST "http://127.0.0.1:8000/audio/transcription" \
         -H "accept: application/json" \
         -F "audio=@/path/to/your_audio.wav"
  3. Transcription (specifying a language):

    curl -X POST "http://127.0.0.1:8000/audio/transcription?language=fra" \
         -H "accept: application/json" \
         -F "audio=@/path/to/french_audio.wav"
  4. Translation:

    curl -X POST "http://127.0.0.1:8000/text/translation" \
         -H "Content-Type: application/json" \
         -d '{"text":"Hello, world!", "src_lang":"eng", "tgt_lang":"fra"}'

Smoke Testing

The script includes run_smoke_test_audio and run_smoke_test_text helper functions that download test files and call the local endpoints. When you run python your_script.py, it performs a few sample calls:

  • French “merci” (should detect fra)
  • Buriat sample (expected bxm)
  • Gettysburg address (English) for transcription
  • Yoruba audio sample for transcription
  • Simple English-to-French translation

You can customize or remove these tests in the __main__ section.


License

  • Code: CC0 1.0 – You’re free to use and adapt without restriction.
  • Models: The pretrained models from Meta (MMS-LID, MMS-1B-ALL, and NLLB) are released under a CC-BY-NC-4.0 license.
    • Please consult each model’s license (see their Hugging Face model pages) for usage terms and attribution requirements.
    • Important: Commercial usage may be restricted under the CC-BY-NC-4.0 license.

References and Acknowledgments

Also See:

If you use or build upon this repository, please consider citing or mentioning the original models and referencing Meta’s relevant research.


Enjoy building massively multilingual conversational AI!