Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to fix the voice across generations ? #554

Open
5 tasks done
MankaranSingh opened this issue Sep 15, 2024 · 5 comments
Open
5 tasks done

How to fix the voice across generations ? #554

MankaranSingh opened this issue Sep 15, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@MankaranSingh
Copy link

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

When generating speech from webui, it samples random voice. How can I fix the generated voice ? I can help with a PR.

2. Additional context or comments

No response

3. Can you help us with this feature?

  • I am interested in contributing to this feature.
@MankaranSingh MankaranSingh added the enhancement New feature or request label Sep 15, 2024
@leng-yue
Copy link
Member

leng-yue commented Sep 16, 2024

You can add a reference audio to pin the timbre.

@czkoko
Copy link

czkoko commented Sep 16, 2024

@leng-yue Using reference audio can pin the timbre well, but the speed and pause seem to be random, and reducing the temperature cannot solve the problem.
I hope it can use different punctuation to control the pause time between words and sentences. Sometimes the pauses between sentences are extremely short and unnatural.

@leng-yue
Copy link
Member

leng-yue commented Sep 16, 2024

Did you include proper puncs in your reference text?

@czkoko
Copy link

czkoko commented Sep 16, 2024

Did you include proper puncs in your reference text?

Yes, the reference audio use the high-quality natural voice synthesized by Microsoft Speech, and the reference text also uses reasonable punctuation.
Under the premise of using the same input text, the same default parameters and the same reference audio, the voice generated multiple times has the same timbre, but their speed, prosody or sentence pause time will be different.
For example, the following samples:
据澎湃新闻消息,上海受台风影响迎来强风雨天气,当地两大外卖平台及生鲜电商对此表示,已经着手采取各项极端天气应对措施,并为骑手配备雨衣、防水套等装备。
result.zip

@leng-yue
Copy link
Member

Since it's an auto-regressive model, having different speed / porsody across different generation is an expected behavior, does this cause any issue on your side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants