Skip to content

Commit

Permalink
Update docs (#638)
Browse files Browse the repository at this point in the history
* Linux pyaudio dependencies

* revert generate.py

* Better bug report & feat request

* Auto-select torchaudio backend

* safety

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: manual seed for restore

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Gradio > 5

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix docs and code

* Update help docs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
AnyaCoder and pre-commit-ci[bot] authored Oct 25, 2024
1 parent e37a445 commit f8a57fb
Show file tree
Hide file tree
Showing 12 changed files with 57 additions and 126 deletions.
2 changes: 1 addition & 1 deletion docs/en/finetune.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ python fish_speech/train.py --config-name text2semantic_finetune \
!!! note
For Windows users, you can use `trainer.strategy.process_group_backend=gloo` to avoid `nccl` issues.

After training is complete, you can refer to the [inference](inference.md) section, and use `--speaker SPK1` to generate speech.
After training is complete, you can refer to the [inference](inference.md) section to generate speech.

!!! info
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
Expand Down
7 changes: 6 additions & 1 deletion docs/en/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ python -m tools.api \
--decoder-config-name firefly_gan_vq
```

If you want to speed up inference, you can add the --compile parameter.
> If you want to speed up inference, you can add the `--compile` parameter.
After that, you can view and test the API at http://127.0.0.1:8080/.

Expand Down Expand Up @@ -107,6 +107,10 @@ The above command synthesizes the desired `MP3` format audio based on the inform
You can also use `--reference_id` (only one can be used) instead of `--reference-audio` and `--reference_text`, provided that you create a `references/<your reference_id>` folder in the project root directory, which contains any audio and annotation text.
The currently supported reference audio has a maximum total duration of 90 seconds.


!!! info
To learn more about available parameters, you can use the command `python -m tools.post_api -h`

## GUI Inference
[Download client](https://github.com/AnyaCoder/fish-speech-gui/releases)

Expand All @@ -120,6 +124,7 @@ python -m tools.webui \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
> If you want to speed up inference, you can add the `--compile` parameter.
!!! note
You can save the label file and reference audio file in advance to the `references` folder in the main directory (which you need to create yourself), so that you can directly call them in the WebUI.
Expand Down
2 changes: 1 addition & 1 deletion docs/ja/finetune.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ python fish_speech/train.py --config-name text2semantic_finetune \
!!! note
Windowsユーザーの場合、`trainer.strategy.process_group_backend=gloo` を使用して `nccl` の問題を回避できます。

トレーニングが完了したら、[推論](inference.md)セクションを参照し、`--speaker SPK1` を使用して音声を生成します
トレーニングが完了したら、[推論](inference.md)セクションを参照し、音声を生成します

!!! info
デフォルトでは、モデルは話者の発話パターンのみを学習し、音色は学習しません。音色の安定性を確保するためにプロンプトを使用する必要があります。
Expand Down
54 changes: 4 additions & 50 deletions docs/ja/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ python -m tools.api \
--decoder-config-name firefly_gan_vq
```

推論を高速化したい場合は、--compile パラメータを追加できます。
> 推論を高速化したい場合は、`--compile` パラメータを追加できます。
その後、`http://127.0.0.1:8080/`で API を表示およびテストできます。

Expand All @@ -90,55 +90,8 @@ python -m tools.post_api \

上記のコマンドは、参照音声の情報に基づいて必要な音声を合成し、ストリーミング方式で返すことを示しています。

`{SPEAKER}``{EMOTION}`に基づいて参照音声をランダムに選択する必要がある場合は、以下の手順に従って設定します:

### 1. プロジェクトのルートディレクトリに`ref_data`フォルダを作成します。

### 2. `ref_data`フォルダ内に次のような構造のディレクトリを作成します。

```
.
├── SPEAKER1
│ ├──EMOTION1
│ │ ├── 21.15-26.44.lab
│ │ ├── 21.15-26.44.wav
│ │ ├── 27.51-29.98.lab
│ │ ├── 27.51-29.98.wav
│ │ ├── 30.1-32.71.lab
│ │ └── 30.1-32.71.flac
│ └──EMOTION2
│ ├── 30.1-32.71.lab
│ └── 30.1-32.71.mp3
└── SPEAKER2
└─── EMOTION3
├── 30.1-32.71.lab
└── 30.1-32.71.mp3
```

つまり、まず`ref_data``{SPEAKER}`フォルダを配置し、各スピーカーの下に`{EMOTION}`フォルダを配置し、各感情フォルダの下に任意の数の音声-テキストペアを配置します

### 3. 仮想環境で以下のコマンドを入力します.

```bash
python tools/gen_ref.py

```

参照ディレクトリを生成します。

### 4. API を呼び出します。

```bash
python -m tools.post_api \
--text "入力するテキスト" \
--speaker "${SPEAKER1}" \
--emotion "${EMOTION1}" \
--streaming True

```

上記の例はテスト目的のみです。
!!! info
使用可能なパラメータの詳細については、コマンド` python -m tools.post_api -h `を使用してください

## WebUI 推論

Expand All @@ -150,6 +103,7 @@ python -m tools.webui \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
> 推論を高速化したい場合は、`--compile` パラメータを追加できます。
!!! note
ラベルファイルと参照音声ファイルをメインディレクトリの `references` フォルダ(自分で作成する必要があります)に事前に保存しておくことで、WebUI で直接呼び出すことができます。
Expand Down
2 changes: 1 addition & 1 deletion docs/pt/finetune.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ python fish_speech/train.py --config-name text2semantic_finetune \
!!! note
Para usuários do Windows, é recomendado usar `trainer.strategy.process_group_backend=gloo` para evitar problemas com `nccl`.

Após concluir o treinamento, consulte a seção [inferência](inference.md), e use `--speaker SPK1` para gerar fala.
Após concluir o treinamento, consulte a seção [inferência](inference.md).

!!! info
Por padrão, o modelo aprenderá apenas os padrões de fala do orador e não o timbre. Ainda pode ser preciso usar prompts para garantir a estabilidade do timbre.
Expand Down
50 changes: 4 additions & 46 deletions docs/pt/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ python -m tools.api \
--decoder-config-name firefly_gan_vq
```

Para acelerar a inferência, adicione o parâmetro `--compile`.
> Para acelerar a inferência, adicione o parâmetro `--compile`.
Depois disso, é possível visualizar e testar a API em http://127.0.0.1:8080/.

Expand All @@ -90,51 +90,8 @@ python -m tools.post_api \

O comando acima indica a síntese do áudio desejada de acordo com as informações do áudio de referência e a retorna em modo de streaming.

Caso selecione, de forma aleatória, o áudio de referência com base em `{SPEAKER}` e `{EMOTION}`, o configure de acordo com as seguintes etapas:

### 1. Crie uma pasta `ref_data` no diretório raiz do projeto.

### 2. Crie uma estrutura de diretórios semelhante à seguinte dentro da pasta `ref_data`.

```
.
├── SPEAKER1
│ ├──EMOTION1
│ │ ├── 21.15-26.44.lab
│ │ ├── 21.15-26.44.wav
│ │ ├── 27.51-29.98.lab
│ │ ├── 27.51-29.98.wav
│ │ ├── 30.1-32.71.lab
│ │ └── 30.1-32.71.flac
│ └──EMOTION2
│ ├── 30.1-32.71.lab
│ └── 30.1-32.71.mp3
└── SPEAKER2
└─── EMOTION3
├── 30.1-32.71.lab
└── 30.1-32.71.mp3
```

Ou seja, primeiro coloque as pastas `{SPEAKER}` em `ref_data`, depois coloque as pastas `{EMOTION}` em cada pasta de orador (speaker) e coloque qualquer número de `pares áudio-texto` em cada pasta de emoção.

### 3. Digite o seguinte comando no ambiente virtual

```bash
python tools/gen_ref.py

```

### 4. Chame a API.

```bash
python -m tools.post_api \
--text "Texto a ser inserido" \
--speaker "${SPEAKER1}" \
--emotion "${EMOTION1}" \
--streaming True
```

O exemplo acima é apenas para fins de teste.
!!! info
Para aprender mais sobre parâmetros disponíveis, você pode usar o comando `python -m tools.post_api -h`

## Inferência por WebUI

Expand All @@ -146,6 +103,7 @@ python -m tools.webui \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
> Para acelerar a inferência, adicione o parâmetro `--compile`.
!!! note
Você pode salvar antecipadamente o arquivo de rótulos e o arquivo de áudio de referência na pasta `references` do diretório principal (que você precisa criar), para que possa chamá-los diretamente na WebUI.
Expand Down
2 changes: 1 addition & 1 deletion docs/zh/finetune.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ python fish_speech/train.py --config-name text2semantic_finetune \
!!! note
对于 Windows 用户, 你可以使用 `trainer.strategy.process_group_backend=gloo` 来避免 `nccl` 的问题.

训练结束后, 你可以参考 [推理](inference.md) 部分, 并携带 `--speaker SPK1` 参数来测试你的模型.
训练结束后, 你可以参考 [推理](inference.md) 部分来测试你的模型.

!!! info
默认配置下, 基本只会学到说话人的发音方式, 而不包含音色, 你依然需要使用 prompt 来保证音色的稳定性.
Expand Down
9 changes: 6 additions & 3 deletions docs/zh/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ python -m tools.api \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
如果你想要加速推理,可以加上`--compile`参数。
> 如果你想要加速推理,可以加上`--compile`参数。
推荐中国大陆用户运行以下命令来启动 HTTP 服务:
```bash
Expand All @@ -100,8 +100,7 @@ python -m tools.post_api \

上面的命令表示按照参考音频的信息,合成所需的音频并流式返回.

下面的示例展示了, 可以一次使用**多个** `参考音频路径``参考音频的文本内容`。在命令里用空格隔开即可。

下面的示例展示了, 可以一次使用**多个** `参考音频路径``参考音频的文本内容`。在命令里用空格隔开即可。
```bash
python -m tools.post_api \
--text "要输入的文本" \
Expand All @@ -117,6 +116,9 @@ python -m tools.post_api \
还可以用`--reference_id`(仅能用一个)来代替`--reference_audio``--reference_text`, 前提是在项目根目录下创建`references/<your reference_id>`文件夹,
里面放上任意对音频与标注文本。 目前支持的参考音频最多加起来总时长90s。

!!! info
要了解有关可用参数的更多信息,可以使用命令`python -m tools.post_api -h`

## GUI 推理
[下载客户端](https://github.com/AnyaCoder/fish-speech-gui/releases)

Expand All @@ -130,6 +132,7 @@ python -m tools.webui \
--decoder-checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
> 如果你想要加速推理,可以加上`--compile`参数。
!!! note
你可以提前将label文件和参考音频文件保存到主目录下的 `references` 文件夹(需要自行创建),这样你可以直接在WebUI中调用它们。
Expand Down
5 changes: 2 additions & 3 deletions tools/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,8 @@
from tools.vqgan.inference import load_model as load_decoder_model

backends = torchaudio.list_audio_backends()
if "sox" in backends:
backend = "sox"
elif "ffmpeg" in backends:

if "ffmpeg" in backends:
backend = "ffmpeg"
else:
backend = "soundfile"
Expand Down
1 change: 0 additions & 1 deletion tools/commons.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ class ServeTTSRequest(BaseModel):
latency: Literal["normal", "balanced"] = "normal"
# not usually used below
streaming: bool = False
emotion: Optional[str] = None
max_new_tokens: int = 1024
top_p: Annotated[float, Field(ge=0.1, le=1.0, strict=True)] = 0.7
repetition_penalty: Annotated[float, Field(ge=0.9, le=2.0, strict=True)] = 1.2
Expand Down
44 changes: 29 additions & 15 deletions tools/post_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
def parse_args():

parser = argparse.ArgumentParser(
description="Send a WAV file and text to a server and receive synthesized audio."
description="Send a WAV file and text to a server and receive synthesized audio.",
formatter_class=argparse.RawTextHelpFormatter,
)

parser.add_argument(
Expand All @@ -33,15 +34,15 @@ def parse_args():
"-id",
type=str,
default=None,
help="ID of the reference model to be used for the speech",
help="ID of the reference model to be used for the speech\n(Local: name of folder containing audios and files)",
)
parser.add_argument(
"--reference_audio",
"-ra",
type=str,
nargs="+",
default=None,
help="Path to the WAV file",
help="Path to the audio file",
)
parser.add_argument(
"--reference_text",
Expand All @@ -68,17 +69,25 @@ def parse_args():
parser.add_argument(
"--format", type=str, choices=["wav", "mp3", "flac"], default="wav"
)
parser.add_argument("--mp3_bitrate", type=int, default=64)
parser.add_argument(
"--mp3_bitrate", type=int, choices=[64, 128, 192], default=64, help="kHz"
)
parser.add_argument("--opus_bitrate", type=int, default=-1000)
parser.add_argument("--latency", type=str, default="normal", help="延迟选项")
parser.add_argument(
"--latency",
type=str,
default="normal",
choices=["normal", "balanced"],
help="Used in api.fish.audio/v1/tts",
)
parser.add_argument(
"--max_new_tokens",
type=int,
default=1024,
help="Maximum new tokens to generate",
default=0,
help="Maximum new tokens to generate. \n0 means no limit.",
)
parser.add_argument(
"--chunk_length", type=int, default=100, help="Chunk length for synthesis"
"--chunk_length", type=int, default=200, help="Chunk length for synthesis"
)
parser.add_argument(
"--top_p", type=float, default=0.7, help="Top-p sampling for synthesis"
Expand All @@ -92,10 +101,7 @@ def parse_args():
parser.add_argument(
"--temperature", type=float, default=0.7, help="Temperature for sampling"
)
parser.add_argument(
"--speaker", type=str, default=None, help="Speaker ID for voice synthesis"
)
parser.add_argument("--emotion", type=str, default=None, help="Speaker's Emotion")

parser.add_argument(
"--streaming", type=bool, default=False, help="Enable streaming response"
)
Expand All @@ -107,7 +113,17 @@ def parse_args():
"--use_memory_cache",
type=str,
default="never",
help="Cache encoded references codes in memory",
choices=["on-demand", "never"],
help="Cache encoded references codes in memory.\n"
"If `on-demand`, the server will use cached encodings\n "
"instead of encoding reference audio again.",
)
parser.add_argument(
"--seed",
type=int,
default=None,
help="`None` means randomized inference, otherwise deterministic.\n"
"It can't be used for fixing a timbre.",
)
parser.add_argument(
"--seed",
Expand Down Expand Up @@ -157,8 +173,6 @@ def parse_args():
"top_p": args.top_p,
"repetition_penalty": args.repetition_penalty,
"temperature": args.temperature,
"speaker": args.speaker,
"emotion": args.emotion,
"streaming": args.streaming,
"use_memory_cache": args.use_memory_cache,
"seed": args.seed,
Expand Down
5 changes: 2 additions & 3 deletions tools/vqgan/extract_vq.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,8 @@
# It's mainly used to generate the training data for the VQ model.

backends = torchaudio.list_audio_backends()
if "sox" in backends:
backend = "sox"
elif "ffmpeg" in backends:

if "ffmpeg" in backends:
backend = "ffmpeg"
else:
backend = "soundfile"
Expand Down

0 comments on commit f8a57fb

Please sign in to comment.