RFC: chat completions endpoint with support for receiving and generating audio #231

fedirz · 2025-01-12T03:46:33Z

This is what I'm referring to: Audio generation

OpenAI has recently added gpt-4o-audio-preview model, which supports the following input/output combinations:

text in → text + audio out
audio in → text + audio out
audio in → text out
text + audio in → text + audio out
text + audio in → text out

I want to create a POST /v1/chat/completions endpoint emulating this functionality. This project will not turn into another LLM runtime like Ollama or VLLM. I'm thinking of speaches acting as a proxy. Here's the flow I have in mind for text + audio in → text + audio out:

receive the audio inputs

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "...",
          "format": "wav"
        }
      }
    ]
  }
]

transcribe the input audio and transform the messages into something a regular LLM can work with

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "text",
        "text": "A rainbow is an optical phenomenon caused by refraction, internal reflection and dispersion of light in water droplets resulting in a continuous spectrum of ..."
      }
    ]
  }
]

Forward the transformed messages to an OpenAI-compatible endpoint the user specified in the config.
Receive the response from an LLM, generate speech from it, and send it back to the user

transformed response with LLM text response + generated speech

Comments and suggestions are welcome!

The text was updated successfully, but these errors were encountered:

silvacarl2 · 2025-01-14T15:53:45Z

I think that one component as well to include in this picture is an LLM too. We are looking at Llama-3.2-3B-Instruct but fine tuning it for answering phone calls for a doctor's office.

fedirz · 2025-01-22T03:39:34Z

Request caching must be implemented to support multi-turn conversations.

fedirz · 2025-01-31T01:46:37Z

@silvacarl2 just wanted to let you know that a mix of audio, text, and image inputs is supported in v0.7.0

jadams777 · 2025-01-31T05:19:42Z

Request caching must be implemented to support multi-turn conversations.

Is that part of the v0.7.0 release?

fedirz · 2025-01-31T14:57:59Z

No, I haven't added request caching for transcriptions yet. See https://speaches.ai/usage/voice-chat/#limitations

jadams777 · 2025-01-31T15:14:19Z

Ok, that link is still down. Thanks.

…

On Fri, Jan 31, 2025 at 6:58 AM Fedir Zadniprovskyi < ***@***.***> wrote: No, I haven't added request caching for transcriptions yet. See https://speaches.ai/usage/voice-chat/#limitations — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARDSKD3KGT2SROEE6BEZOAD2NOFQ3AVCNFSM6AAAAABVATF6YOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRXGU2TAMJTGE> . You are receiving this because you commented.Message ID: ***@***.***>

fedirz · 2025-02-01T03:04:59Z

Ok, that link is still down. Thanks.
…

Fixed in #304. Sorry about that

fedirz added the enhancement New feature or request label Jan 12, 2025

fedirz added this to the v0.7.0 milestone Jan 12, 2025

fedirz self-assigned this Jan 12, 2025

fedirz changed the title ~~Add chat completions endpoint with support for receiving and generating audio~~ RFC: chat completions endpoint with support for receiving and generating audio Jan 12, 2025

fedirz pinned this issue Jan 12, 2025

fedirz mentioned this issue Jan 14, 2025

Live Transcription using websocket doest not work. #111

Open

fedirz pushed a commit that referenced this issue Jan 26, 2025

feat: audio chat (#231)

56168ea

fedirz mentioned this issue Jan 26, 2025

feat/audio chat 231 #280

Merged

fedirz pushed a commit that referenced this issue Jan 26, 2025

feat: audio chat (#231)

c308555

fedirz pushed a commit that referenced this issue Jan 26, 2025

feat: audio chat (#231)

3afee5f

fedirz closed this as completed Jan 30, 2025

fedirz unpinned this issue Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: chat completions endpoint with support for receiving and generating audio #231

RFC: chat completions endpoint with support for receiving and generating audio #231

fedirz commented Jan 12, 2025 •

edited

Loading

silvacarl2 commented Jan 14, 2025

fedirz commented Jan 22, 2025

fedirz commented Jan 31, 2025

jadams777 commented Jan 31, 2025

fedirz commented Jan 31, 2025

jadams777 commented Jan 31, 2025 via email

fedirz commented Feb 1, 2025

RFC: chat completions endpoint with support for receiving and generating audio #231

RFC: chat completions endpoint with support for receiving and generating audio #231

Comments

fedirz commented Jan 12, 2025 • edited Loading

silvacarl2 commented Jan 14, 2025

fedirz commented Jan 22, 2025

fedirz commented Jan 31, 2025

jadams777 commented Jan 31, 2025

fedirz commented Jan 31, 2025

jadams777 commented Jan 31, 2025 via email

fedirz commented Feb 1, 2025

fedirz commented Jan 12, 2025 •

edited

Loading