openapi | description | mode |
---|---|---|
post /v1/text-to-speech/{voice_id}/with-timestamps |
Converts text into audio together with timestamps on when which word was spoken. |
wide |
You can generate audio together with information on when which character was spoken using the following script:
import requests
import json
import base64
VOICE_ID = "21m00Tcm4TlvDq8ikWAM" # Rachel
YOUR_XI_API_KEY = "ENTER_YOUR_API_KEY_HERE"
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/with-timestamps"
headers = {
"Content-Type": "application/json",
"xi-api-key": YOUR_XI_API_KEY
}
data = {
"text": (
"Born and raised in the charming south, "
"I can add a touch of sweet southern hospitality "
"to your audiobooks and podcasts"
),
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.75
}
}
response = requests.post(
url,
json=data,
headers=headers,
)
if response.status_code != 200:
print(f"Error encountered, status: {response.status_code}, "
f"content: {response.text}")
quit()
# convert the response which contains bytes into a JSON string from utf-8 encoding
json_string = response.content.decode("utf-8")
# parse the JSON string and load the data as a dictionary
response_dict = json.loads(json_string)
# the "audio_base64" entry in the dictionary contains the audio as a base64 encoded string,
# we need to decode it into bytes in order to save the audio as a file
audio_bytes = base64.b64decode(response_dict["audio_base64"])
with open('output.mp3', 'wb') as f:
f.write(audio_bytes)
# the 'alignment' entry contains the mapping between input characters and their timestamps
print(response_dict['alignment'])
This prints out a dictionary like:
{
'characters': ['B', 'o', 'r', 'n', ' ', 'a', 'n', 'd', ' ', 'r', 'a', 'i', 's', 'e', 'd', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'c', 'h', 'a', 'r', 'm', 'i', 'n', 'g', ' ', 's', 'o', 'u', 't', 'h', ',', ' ', 'I', ' ', 'c', 'a', 'n', ' ', 'a', 'd', 'd', ' ', 'a', ' ', 't', 'o', 'u', 'c', 'h', ' ', 'o', 'f', ' ', 's', 'w', 'e', 'e', 't', ' ', 's', 'o', 'u', 't', 'h', 'e', 'r', 'n', ' ', 'h', 'o', 's', 'p', 'i', 't', 'a', 'l', 'i', 't', 'y', ' ', 't', 'o', ' ', 'y', 'o', 'u', 'r', ' ', 'a', 'u', 'd', 'i', 'o', 'b', 'o', 'o', 'k', 's', ' ', 'a', 'n', 'd', ' ', 'p', 'o', 'd', 'c', 'a', 's', 't', 's'],
'character_start_times_seconds': [0.0, 0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06],
'character_end_times_seconds': [0.186, 0.279, 0.348, 0.406, 0.441, 0.476, 0.499, 0.522, 0.58, 0.65, 0.72, 0.778, 0.824, 0.882, 0.906, 0.952, 0.975, 1.01, 1.045, 1.068, 1.091, 1.115, 1.149, 1.196, 1.254, 1.3, 1.358, 1.416, 1.474, 1.498, 1.521, 1.602, 1.66, 1.811, 1.869, 1.927, 1.974, 2.009, 2.043, 2.067, 2.136, 2.183, 2.218, 2.252, 2.287, 2.322, 2.357, 2.392, 2.426, 2.45, 2.508, 2.531, 2.589, 2.635, 2.682, 2.717, 2.763, 2.786, 2.81, 2.879, 2.937, 3.007, 3.065, 3.123, 3.17, 3.239, 3.286, 3.367, 3.402, 3.437, 3.46, 3.483, 3.529, 3.564, 3.599, 3.634, 3.68, 3.75, 3.82, 3.889, 3.971, 4.087, 4.168, 4.214, 4.272, 4.331, 4.389, 4.412, 4.447, 4.528, 4.551, 4.574, 4.609, 4.644, 4.702, 4.748, 4.807, 4.865, 4.923, 5.016, 5.074, 5.12, 5.155, 5.201, 5.248, 5.283, 5.306, 5.329, 5.352, 5.41, 5.457, 5.573, 5.654, 5.735, 5.886, 5.944, 6.06, 6.548]
}
As you can see this dictionary contains three lists of the same size. For example response_dict['alignment']['characters'][3] contains the fourth character in the text you provided 'n', response_dict['alignment']['character_start_times_seconds'][3] and response_dict['alignment']['character_end_times_seconds'][3] contain its start (0.348 seconds) and end (0.406 seconds) timestamps.