-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about tokenizer #106
Comments
Is there any way to adjust tokenizer parameters that how the tokenizer(?) divides the sentences? Yes, please look into the code that performs the sentence splitting. Most of the parameters of this method which customize the sentence splitting are also present in the play and play_async methods of RealtimeTTS. For example there is the context_size parameter of the play method, which is used to establish context for sentence boundary detection by the tokenizers. It sets the number of characters that are additionally presented to the tokenizer after a sentence boundary like a punctuation. A larger context improves the accuracy of detecting sentence boundaries, lower context would make it faster. May I ask how sentence-splitting is done when the program is configured to (being~by) feed generator iterators? When calling play or play_async method RealtimeTTS will start to consume the generator(s) and retrieved textchunks. It will present the accumulated chunks to the tokenizer, which then tries to detect a full sentence from those. Will the tokenizer be able to split the two sentences? Yes it should, at least that's its job :) If so, is there any way to adjust the tokenizer for its way to divide the sentences? Not more than in these parameters. Simply put, I would want to know if there is a way to make stream output faster when feeding the engine with generator iterators. fast_sentence_fragment parameter "overwrites" the tokenizers by also searching for delimiters like "," or "-" which would not mark a full sentence but a "somewhat synthesizable fragment". So if you use these you can speed up TTS generation in the first retrieved sentence, which can be quite crucial for generating a really fast answer. Gives up a bit on synthesis quality of course. If you don't mind, you can also finetune all other parameters to "fast" settings. Like setting context_size to 2 instead of default 12. Or minimum_sentence_length to 3 or 4 instead of 10. |
I did not find the constructor containing these keywords, so I have to modify the origin(package) file. Is there anything I'm missing? Thanks! |
Also, is it possible to use another tokenizer(other than nltk or stanza) |
I did not find the constructor containing these keywords, so I have to modify the origin(package) file. Is there anything I'm missing? The parameters are part of the play- and play_async-methods. Also, I find that the second sentence takes significantly more time to generate than the first(the two sentence are similar in length), even if I changed the parameters in the package file into very aggressive settings, I observe(or presume) that the tokenizer only continues feeds into the stream after ALL of the rest of the chunk is fed, any ideaon what may cause this? Can you provide example code to reproduce this? Also, is it possible to use another tokenizer(other than nltk or stanza) Currently not, you'd need to change the code of stream2sentence library to do that. |
Yes, here's the code to reproduce the issue, however the example is in Chinese, but I think it is pretty obvious to observe the issue, I am using azure engine and voice="zh-CN-XiaoshuangNeural" stream.language = "zh-CN". The '/' between characters of the data represents a new line. As you could see, each time(mostly) one or two Chinese character is fed from the generator into the stream.
And this is why I observe(or presume) that the tokenizer only continues feeds into the stream after ALL of the rest of the chunk is fed. Thanks again for your help:) |
If you use stream.feed(response).play() then RealtimeTTS would use the standard tokenizer, which is nltk. For chinese you want stanza:
Then what I think is going on goes back to the data. Your first line looks like this: |
Made some tests, finally I can reproduce it. This does not seem to behave like it should, maybe something wrong in stream2sentence, please give me some time to look into that. |
I think I have a bugfix for stream2sentence now and hopefully with that one the problem should be gone. pip install stream2sentence==0.2.4 and tell me if it's better with this one? |
Sure, it is better now. Since the requirements in RealtimeTTS isn't updated, I manually pulled RealtimeTTS from github, changed the stream2sentence==0.2.4 in the requirements.txt of and built python dependencies(package) from it.
It is much better. It doesn't seem to be working for tokenizer=nltk, though. |
If I want to switch to another tokenizer(other than nltk or stanza), which files would need to be modified? |
I find the performance cost extremely high(it almost drained my M3Max CPU), is there any way to mitigate this? If I want to switch to another tokenizer(other than nltk or stanza), which files would need to be modified?
So you can implement your own sentence splitting algo or hook in another tokenizer here. Example: if __name__ == '__main__':
import re
def tokenize_sentences(text):
"""
Splits the input text into sentences using simple heuristics.
Args:
text (str): The input text to be split into sentences.
Returns:
list: A list of sentences.
"""
# Define sentence-ending punctuation
sentence_ends = r'[.!?。\n]'
# Define abbreviations and other exceptions
abbreviations = r'\b(Mr|Mrs|Dr|Ms|Sr|Jr|etc|e\.g|i\.e|vs|U\.S\.A|D\.C)\.'
# Split the text into potential sentences
# Split the text into potential sentences
potential_sentences = re.split(f'({sentence_ends}(?:\\s|$))', text)
# Combine the split parts back into sentences
sentences = []
current_sentence = ''
for i, part in enumerate(potential_sentences):
current_sentence += part
# Check if this part ends with sentence-ending punctuation
if re.search(sentence_ends + r'(?:\s|$)', part):
# Check if the period is part of an abbreviation
if not re.search(abbreviations + r'$', current_sentence.strip()):
# Check if the next part starts with a lowercase letter
if i + 1 < len(potential_sentences) and re.match(r'^\s*[a-z]', potential_sentences[i+1]):
continue
sentences.append(current_sentence.strip())
current_sentence = ''
# Add any remaining text as the last sentence
if current_sentence:
sentences.append(current_sentence.strip())
return sentences
# Example usage
text = "Hello, world! This is a test. Mr. Smith went to Washington D.C. this morning. Is this working?"
result = tokenize_sentences(text)
print(result)
import os
import time
from RealtimeTTS import TextToAudioStream, AzureEngine
engine = AzureEngine(os.environ.get("AZURE_SPEECH_KEY"), os.environ.get("AZURE_SPEECH_REGION"),
voice="zh-CN-XiaoshuangNeural")
#voice="zh-CN-XiaoxiaoNeural")
stream = TextToAudioStream(engine)
def line_generator(data):
# Split the input data by the newline character to get individual lines
lines = data.split('/')
for line in lines:
line = line + "。"
print(f"GEN: {line}")
time.sleep(0.01)
yield line
#The input data
data = """
胡/爷/爷,我/来/给/您/讲/一下/下/周/每/天/的/安/排。
周/一/:/9:00-10:00:晨/练/太/极/拳/,/地点/:/活/动/室/
10:30-11:30:园/艺/活/动/菠菜/种/植/,/地点/:/花/园/
14:00-15:00:手/工/制/作/睡/眠/香/囊/,/地点/:/手/工/室/
15:30-16:30:观/看/老/电/影/,/地点/:/影/音/室/
"""
response = line_generator(data)
stream.feed(response)
stream.play(
minimum_sentence_length=1,
minimum_first_fragment_length=1,
before_sentence_synthesized = lambda sentence:
print("Synthesizing: " + sentence),
tokenizer="None",
tokenize_sentences=tokenize_sentences,
language="zh",
context_size=2) |
Is there any way to adjust tokenizer parameters that how the tokenizer(?) divides the sentences? May I ask how sentence-splitting is done when the program is configured to (being~by) feed generator iterators?
For example, if I feed this string(or whatever you call it) word by word(punctuation counts as one word).
"Hello, can you tell me how the sentence splitting works? I want to know the performance difference between ntlk and stanza."
Will the tokenizer be able to split the two sentences? If so, is there any way to adjust the tokenizer for its way to divide the sentences?
Or perhaps it just waits until the entire string is received and calls the TTS engine then.
I am using language which is not English, but I noticed little difference between tokenizer as "nltk" or "stanza"
Simply put, I would want to know if there is a way to make stream output faster when feeding the engine with generator iterators.
Really appreciate if you could help!
Thanks
The text was updated successfully, but these errors were encountered: