Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegram Bot and Coqui Improvments #144

Closed
wants to merge 11 commits into from
Closed

Conversation

zaptrem
Copy link
Contributor

@zaptrem zaptrem commented May 24, 2023

Note: This PR description was generated in part (with lots of re-prompting/editing) with Bing-GPT4! It would be neat to automate this.

This pull request introduces a voice-to-voice Telegram bot that shows off Coqui TTS's prompt-to-voice and (soon) audio-to-voice models. The pull request consists of two main parts:

1. Coqui Synthesizer Changes

The CoquiSynthesizer class now supports asynchronous parallel synthesis of large audio segments using the async_synthesize method, which is added to the BaseSynthesizer class as well. It adds one new dependency (aiohttp) to enable non-blocking http calls, and dependency (SpeechRecognition) that is already in use by the library but wasn't present in poetry.

The async_synthesize method works as follows:

  • It splits a long text into smaller chunks of less than 250 characters, which is the maximum length that Coqui TTS can handle at once.
  • It creates a list of tasks for each chunk using asyncio.create_task(), which returns a coroutine object that can be awaited later.
  • It waits for all tasks to complete using asyncio.gather(), which returns a list of results in the same order as the tasks.
  • It concatenates and returns the results as an AudioSegment object.

An example of using the async_synthesize method is:

from vocode.turn_based.synthesizer.coqui_synthesizer import CoquiSynthesizer
from pydub import AudioSegment
import asyncio

synth = CoquiSynthesizer()
text = "Insert a long message that needs to be synthesized in chunks of >250 letters."
audio = asyncio.run(synth.async_synthesize(text)) # type: AudioSegment
audio.export("output.wav", format="wav")

2. Prompt-To-Voice Telegram Bot

A demonstration of how Coqui TTS can be integrated with Vocode and telegram to create engaging voice applications.

The bot (based on albertwujj's work) uses the python-telegram-bot library to handle user messages and commands, the WhisperTranscriber class to transcribe voice messages from users, and the ChatGPTAgent class to generate text responses based on a system prompt and the user input. The system prompt is even customized based on the voice name and description of the current voice. The bot also allows the user to select or create different voices using Coqui TTS's voice creation APIs.

The bot supports the following commands for the user to interact with it:

  • /start: Initializes the user data and sends a welcome message.
  • /voice <voice_id>: Changes the current voice to the one with the given id and resets the conversation. The voice id must be an integer corresponding to one of the available voices.
  • /create <voice_description>: Creates a new Coqui TTS voice from a text prompt and switches to it. The voice description must be a string that describes how the voice should sound like.
  • /list: Lists all the available voices with their ids, names, and descriptions (if any).
  • /who: Shows the name and description (if any) of the current voice.
  • /help: Shows a help message with all the available commands.

TODO:

  • Get feedback on code cleanliness and readability.
  • Test the changes to CoquiSynthesizer more thoroughly and handle possible errors or edge cases.
  • Make the bot work on Replit including use of ReplitDB for conversation/voice persistence between instances.
  • Evaluate if the InMemoryDB wrapper is the best way to handle non-existent users or if there is a better alternative.
  • Add voice cloning from audio clip feature using Coqui TTS's clone endpoint.
  • Implement Coqui improvements/fix in streaming synthesizer or create separate issue.
  • Investigate issue on Coqui's side relating to dropped sentences and/or switch to their new (but slower) model.

@zaptrem zaptrem requested a review from ajar98 May 24, 2023 07:00
@zaptrem zaptrem removed the request for review from ajar98 May 24, 2023 07:06
Copy link
Collaborator

@Kian1354 Kian1354 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks really good! left minor nits/ideas

@@ -0,0 +1,34 @@
# client_backend
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, there are some references to client backend/vocode react sdk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use poetry for everything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from vocode.turn_based.synthesizer.stream_elements_synthesizer import (
StreamElementsSynthesizer,
)
from vocode.turn_based.synthesizer.eleven_labs_synthesizer import ElevenLabsSynthesizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but i think you can actually import all of these together since we have an init file now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

SYNTH = CoquiSynthesizer()

# Array of tuples (synthesizer's voice id, nickname, description if text to voice)
DEFAULT_VOICES = [("d2bd7ccb-1b65-4005-9578-32c4e02d8ddf", "Coqui Default", None)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the first value here? wondering if it should it be hard coded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



class VocodeBotResponder:
def __init__(self, transcriber, system_prompt, synthesizer, db=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we type these (and the rest of the pr)? should run mypy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self, update: Update, context: ContextTypes.DEFAULT_TYPE
):
chat_id = update.effective_chat.id
_, name, description = self.get_current_voice(chat_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on above, should just be name

- Use /help to see this help message again.
"""
if type(self.synthesizer) is CoquiSynthesizer:
help_text += "\n- Use /create <voice_description> to create a new Coqui TTS voice from a text prompt and switch to it."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit +=

can use f""


COQUI_BASE_URL = "https://app.coqui.ai/api/v2/"
DEFAULT_SPEAKER_ID = "d2bd7ccb-1b65-4005-9578-32c4e02d8ddf"
MAX_TEXT_LENGTH = 250 # The maximum length of text that can be synthesized at once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's include this as a parameter? and default to 250

response = requests.post(url, headers=headers, json=body)
assert response.ok, response.text
sample = response.json()
response = requests.get(sample["audio_url"])
return AudioSegment.from_wav(io.BytesIO(response.content)) # type: ignore

def split_text(self, text: str) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: this implementation may be easier to read? via gpt4

def split_text(self, text: str) -> List[str]:
        sentence_enders = re.compile('[.!?]')
        sentences = sentence_enders.split(text)
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            sentence = sentence.strip()
            if not sentence:
                continue
            
            proposed_chunk = current_chunk + sentence
            if len(proposed_chunk) > 250:
                chunks.append(current_chunk.strip())
                current_chunk = sentence + "."
            else:
                current_chunk = proposed_chunk + "."
                
        if current_chunk:
            chunks.append(current_chunk.strip())
            
        return chunks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

CoquiTTSSynthesizer: "speaker",
RimeSynthesizer: "speaker",
}
assert set(voice_attr_of.keys()) == set(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove these asserts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't have these I get type errors.

api_key: Optional[str] = None,
):
self.voice_id = voice_id or DEFAULT_SPEAKER_ID
self.voice_prompt = voice_prompt
self.xtts = xtts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe something more descriptive like enable_xtts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zaptrem zaptrem closed this May 30, 2023
@zaptrem zaptrem reopened this May 30, 2023
@zaptrem
Copy link
Contributor Author

zaptrem commented May 30, 2023

@Kian1354 replaced by #172 and #173

@zaptrem zaptrem closed this May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants