This repository contains a fully featured sample script to demonstrate how to connect to the Resemble.AI Live STS server using Socket.IO for real-time speech-to-speech.
-
Clone the repository:
git clone https://github.com/resemble-ai/resemble-live-sts-socket.git cd resemble-live-sts-socket
-
Install the required dependencies:
conda create -n socket_demo python=3.11.4 pip install -r requirements.txt
Running the script:
python main.py --url <server_url> --voice <voice>
# Or if you are using authentication
python main.py --url <server_url> --voice <voice> --auth <username:password>
If you do not want to input your microphone and speaker IDs each time, then the use following command with the two IDs you have been choosing:
python main.py --url <server_url> --voice <voice> \
--input_device <microphone id> \
--output_device <speaker id>
usage: main.py [-h] --url URL [--auth AUTH] [--debug] [--num_chunks NUM_CHUNKS] [--wave_file_path WAVE_FILE_PATH]
[--voice VOICE] [--vad VAD] [--gpu GPU] [--extra_convert_size EXTRA_CONVERT_SIZE] [--pitch PITCH]
[--crossfade_offset_rate CROSSFADE_OFFSET_RATE] [--crossfade_end_rate CROSSFADE_END_RATE]
[--crossfade_overlap_size CROSSFADE_OVERLAP_SIZE] [--input_device INPUT_DEVICE] [--output_device OUTPUT_DEVICE]
Resemble.AI LiveVC socket sample script. Press Ctrl+C to stop.
options:
-h, --help show this help message and exit
--url URL URL of the server (required)
--auth AUTH ngrok `username:password` for authentication.
--debug Enable debug mode for logging.
client parameters:
--num_chunks NUM_CHUNKS Number of 2880-frame chunks to send to the server (default: 8).
--wave_file_path WAVE_FILE_PATH Path to save the WAV file (default: output.wav).
voice parameters:
--voice VOICE Name of the voice to use for synthesis.
--vad VAD VAD level (0: off, 1: low, 2: medium, 3: high) (default: 1).
--gpu GPU CUDA device ID (default: 0).
--extra_convert_size EXTRA_CONVERT_SIZE Amount of context for the server to use (4096, 8192, 16384,
32768, 65536, 131072) (default: 4096).
--pitch PITCH Pitch factor (default: 0).
--crossfade_offset_rate CROSSFADE_OFFSET_RATE Crossfade offset rate (0.0 - 1.0) (default: 0.1)
--crossfade_end_rate CROSSFADE_END_RATE Crossfade end rate (0.0 - 1.0) (default: 0.9).
--crossfade_overlap_size CROSSFADE_OVERLAP_SIZE Crossfade overlap size (default: 2048).
audio device selection. If not specified the user will be provided with a list of devices to choose from.:
--input_device INPUT_DEVICE, -i INPUT_DEVICE Index of the input audio device.
--output_device OUTPUT_DEVICE, -o OUTPUT_DEVICE Index of the output audio device.
This implementation uses Socket.IO to connect to the server and achieves real-time voice conversion on a M2 Macbook Air client machine, but a lower level websocket implementation can be made as well.
- request_conversion:
- Type:
Emit
- Description: Sends audio data to the server
- Data type:
AudioData
- Triggers:
on_response
- Type:
- request_conversion_debug:
- Type:
Emit
- Description: Identical to
request_conversion
, but asks the server to return the unconverted audio. This is good for testing the effects of server latency on local audio stitching on clean audio - Data type:
AudioData
- Triggers:
on_response
- Type:
- update_model_settings:
- Type:
Emit
- Description: Sends updated settings to the server
- Data type:
VoiceSettings
- Triggers:
on_message
- Type:
- get_settings:
- Type:
Emit
- Description: Requests the current settings dict from the server.
- Triggers:
on_message
- Type:
- get_voices:
- Type:
Emit
- Description: Requests a list of available voices from the server.
- Triggers:
on_message
- Type:
- get_gpus:
- Type
Emit
- Description: Requests a list of available GPUs from the server.
- Triggers:
on_message
- Type
- on_connect:
- Type:
Response
- Description: Callback for when a connection is established to the server
- Type:
- on_disconnect:
- Type:
Response
- Description: Callback for when the connection to the server is disconnected.
- Type:
- on_response:
- Type:
Response
- Description: Called when the client receives the audio response from the server.
- Data type:
MessageResponse
- Type:
- on_message:
- Type:
Response
- Description: Called when the client receives a message from the server.
- Data type:
MessageResponse
- Type:
All data sent to and from the server will be in the following data types.
class MessageResponse(TypedDict):
status: HTTPStatus
message: str | dict
endpoint: str
class AudioData(TypedDict):
timestamp: int # milliseconds
audio_data: bytes # packed little-endian short integers
class VoiceSettings(TypedDict):
voice: str # The name of the voice being used
crossFadeOffsetRate: float # 0.0 - 1.0
crossFadeEndRate: float # 0.0 - 1.0
crossFadeOverlapSize: int # 2048
extraConvertSize: Literal[4096, 8192, 16384, 32768, 65536, 131072]
gpu: int # CUDA device ID
pitch: float
vad: Literal[0, 1, 2, 3] # 0: off, 1: low, 2: medium, 3: high
These constants are configured to be exactly what the server expects. They cannot be changed.
- SAMPLERATE = 48000: Sample rate for audio processing.
- CHUNK_SIZE_MULT = 2880: Chunk size multiplier.
- AUDIO_FORMAT = 'int16': Audio format.
- ENDPOINT = '/synthesize': Socket endpoint.
For more information, please refer to the code comments and docstrings within the scripts.
This project is licensed under MIT. See the LICENSE file for details.