Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: chat completions endpoint with support for receiving and generating audio #231

Closed
fedirz opened this issue Jan 12, 2025 · 7 comments
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@fedirz
Copy link
Collaborator

fedirz commented Jan 12, 2025

This is what I'm referring to: Audio generation

OpenAI has recently added gpt-4o-audio-preview model, which supports the following input/output combinations:

  • text in → text + audio out
  • audio in → text + audio out
  • audio in → text out
  • text + audio in → text + audio out
  • text + audio in → text out

I want to create a POST /v1/chat/completions endpoint emulating this functionality. This project will not turn into another LLM runtime like Ollama or VLLM. I'm thinking of speaches acting as a proxy. Here's the flow I have in mind for text + audio in → text + audio out:

  • receive the audio inputs
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "...",
          "format": "wav"
        }
      }
    ]
  }
]
  • transcribe the input audio and transform the messages into something a regular LLM can work with
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "text",
        "text": "A rainbow is an optical phenomenon caused by refraction, internal reflection and dispersion of light in water droplets resulting in a continuous spectrum of ..."
      }
    ]
  }
]
  • Forward the transformed messages to an OpenAI-compatible endpoint the user specified in the config.
  • Receive the response from an LLM, generate speech from it, and send it back to the user
transformed response with LLM text response + generated speech

Comments and suggestions are welcome!

@fedirz fedirz added the enhancement New feature or request label Jan 12, 2025
@fedirz fedirz added this to the v0.7.0 milestone Jan 12, 2025
@fedirz fedirz self-assigned this Jan 12, 2025
@fedirz fedirz changed the title Add chat completions endpoint with support for receiving and generating audio RFC: chat completions endpoint with support for receiving and generating audio Jan 12, 2025
@fedirz fedirz pinned this issue Jan 12, 2025
@silvacarl2
Copy link

I think that one component as well to include in this picture is an LLM too. We are looking at Llama-3.2-3B-Instruct but fine tuning it for answering phone calls for a doctor's office.

@fedirz
Copy link
Collaborator Author

fedirz commented Jan 22, 2025

Request caching must be implemented to support multi-turn conversations.

fedirz pushed a commit that referenced this issue Jan 26, 2025
fedirz pushed a commit that referenced this issue Jan 26, 2025
fedirz pushed a commit that referenced this issue Jan 26, 2025
@fedirz fedirz closed this as completed Jan 30, 2025
@fedirz
Copy link
Collaborator Author

fedirz commented Jan 31, 2025

@silvacarl2 just wanted to let you know that a mix of audio, text, and image inputs is supported in v0.7.0

@fedirz fedirz unpinned this issue Jan 31, 2025
@jadams777
Copy link

Request caching must be implemented to support multi-turn conversations.

Is that part of the v0.7.0 release?

@fedirz
Copy link
Collaborator Author

fedirz commented Jan 31, 2025

No, I haven't added request caching for transcriptions yet. See https://speaches.ai/usage/voice-chat/#limitations

@jadams777
Copy link

jadams777 commented Jan 31, 2025 via email

@fedirz
Copy link
Collaborator Author

fedirz commented Feb 1, 2025

Ok, that link is still down. Thanks.

Fixed in #304. Sorry about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants