Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ready] [Recipes] add aishell2 #465

Merged
merged 14 commits into from
Jul 14, 2022
19 changes: 19 additions & 0 deletions egs/aishell2/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

# Introduction

This recipe includes some different ASR models trained with Aishell2.

[./RESULTS.md](./RESULTS.md) contains the latest results.

# Transducers

There are various folders containing the name `transducer` in this folder.
The following table lists the differences among them.

| | Encoder | Decoder | Comment |
|---------------------------------------|---------------------|--------------------|-----------------------------|
| `pruned_transducer_stateless5` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless5 in librispeech recipe |

The decoder in `transducer_stateless` is modified from the paper
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
We place an additional Conv1d layer right after the input embedding layer.
89 changes: 89 additions & 0 deletions egs/aishell2/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
## Results

### Aishell2 char-based training results (Pruned Transducer 5)

#### 2022-07-11

Using the codes from this commit https://github.com/k2-fsa/icefall/pull/465.

When training with context size equals to 1, the WERs are

| | dev-ios | test-ios | comment |
|------------------------------------|-------|----------|----------------------------------|
| greedy search | 5.57 | 5.89 | --epoch 25, --avg 5, --max-duration 600 |
| modified beam search (beam size 4) | 5.32 | 5.56 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search (set as default) | 5.5 | 5.78 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search nbest | 5.46 | 5.74 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search oracle | 1.92 | 2.2 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search nbest LG | 5.59 | 5.93 | --epoch 25, --avg 5, --max-duration 600 |

The training command for reproducing is given below:

```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"

./pruned_transducer_stateless5/train.py \
--world-size 4 \
--lang-dir data/lang_char \
--num-epochs 40 \
--start-epoch 1 \
--exp-dir /result \
--max-duration 300 \
--use-fp16 0 \
--num-encoder-layers 24 \
--dim-feedforward 1536 \
--nhead 8 \
--encoder-dim 384 \
--decoder-dim 512 \
--joiner-dim 512 \
--context-size 1
```

The decoding command is:
```bash
for method in greedy_search modified_beam_search \
fast_beam_search fast_beam_search_nbest \
fast_beam_search_nbest_oracle fast_beam_search_nbest_LG; do
./pruned_transducer_stateless5/decode.py \
--epoch 25 \
--avg 5 \
--exp-dir ./pruned_transducer_stateless5/exp \
--max-duration 600 \
--decoding-method $method \
--max-sym-per-frame 1 \
--num-encoder-layers 24 \
--dim-feedforward 1536 \
--nhead 8 \
--encoder-dim 384 \
--decoder-dim 512 \
--joiner-dim 512 \
--context-size 1 \
--beam 20.0 \
--max-contexts 8 \
--max-states 64 \
--num-paths 200 \
--nbest-scale 0.5 \
--context-size 1 \
--use-averaged-model True
done
```
The tensorboard training log can be found at
https://tensorboard.dev/experiment/RXyX4QjQQVKjBS2eQ2Qajg/#scalars

A pre-trained model and decoding logs can be found at <https://huggingface.co/yuekai/icefall-asr-aishell2-pruned-transducer-stateless5-B-2022-07-12>

When training with context size equals to 2, the WERs are

| | dev-ios | test-ios | comment |
|------------------------------------|-------|----------|----------------------------------|
| greedy search | 5.47 | 5.81 | --epoch 25, --avg 5, --max-duration 600 |
| modified beam search (beam size 4) | 5.38 | 5.61 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search (set as default) | 5.36 | 5.61 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search nbest | 5.37 | 5.6 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search oracle | 2.04 | 2.2 | --epoch 25, --avg 5, --max-duration 600 |
| fast beam search nbest LG | 5.59 | 5.82 | --epoch 25, --avg 5, --max-duration 600 |

The tensorboard training log can be found at
https://tensorboard.dev/experiment/5AxJ8LHoSre8kDAuLp4L7Q/#scalars

A pre-trained model and decoding logs can be found at <https://huggingface.co/yuekai/icefall-asr-aishell2-pruned-transducer-stateless5-A-2022-07-12>
Empty file.
114 changes: 114 additions & 0 deletions egs/aishell2/ASR/local/compute_fbank_aishell2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env python3
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This file computes fbank features of the aishell2 dataset.
It looks for manifests in the directory data/manifests.

The generated fbank features are saved in data/fbank.
"""

import argparse
import logging
import os
from pathlib import Path

import torch
from lhotse import CutSet, Fbank, FbankConfig, LilcomChunkyWriter
from lhotse.recipes.utils import read_manifests_if_cached

from icefall.utils import get_executor

# Torch's multithreaded behavior needs to be disabled or
# it wastes a lot of CPU and slow things down.
# Do this outside of main() in case it needs to take effect
# even when we are not invoking the main (e.g. when spawning subprocesses).
torch.set_num_threads(1)
torch.set_num_interop_threads(1)


def compute_fbank_aishell2(num_mel_bins: int = 80):
src_dir = Path("data/manifests")
output_dir = Path("data/fbank")
num_jobs = min(15, os.cpu_count())

dataset_parts = (
"train",
"dev",
"test",
)
prefix = "aishell2"
suffix = "jsonl.gz"
manifests = read_manifests_if_cached(
dataset_parts=dataset_parts,
output_dir=src_dir,
prefix=prefix,
suffix=suffix,
)
assert manifests is not None

extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))

with get_executor() as ex: # Initialize the executor only once.
for partition, m in manifests.items():
if (output_dir / f"{prefix}_cuts_{partition}.{suffix}").is_file():
logging.info(f"{partition} already exists - skipping.")
continue
logging.info(f"Processing {partition}")
cut_set = CutSet.from_manifests(
recordings=m["recordings"],
supervisions=m["supervisions"],
)
if "train" in partition:
cut_set = (
cut_set
+ cut_set.perturb_speed(0.9)
+ cut_set.perturb_speed(1.1)
)
cut_set = cut_set.compute_and_store_features(
extractor=extractor,
storage_path=f"{output_dir}/{prefix}_feats_{partition}",
# when an executor is specified, make more partitions
num_jobs=num_jobs if ex is None else 80,
executor=ex,
storage_type=LilcomChunkyWriter,
)
cut_set.to_file(output_dir / f"{prefix}_cuts_{partition}.{suffix}")


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--num-mel-bins",
type=int,
default=80,
help="""The number of mel bins for Fbank""",
)

return parser.parse_args()


if __name__ == "__main__":
formatter = (
"%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
)

logging.basicConfig(format=formatter, level=logging.INFO)

args = get_args()
compute_fbank_aishell2(num_mel_bins=args.num_mel_bins)
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/compute_fbank_musan.py
96 changes: 96 additions & 0 deletions egs/aishell2/ASR/local/display_manifest_statistics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
#!/usr/bin/env python3
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
#
# See ../../../../LICENSE for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This file displays duration statistics of utterances in a manifest.
You can use the displayed value to choose minimum/maximum duration
to remove short and long utterances during the training.

See the function `remove_short_and_long_utt()` in transducer_stateless/train.py
for usage.
"""


from lhotse import load_manifest_lazy


def main():
paths = [
"./data/fbank/aishell2_cuts_train.jsonl.gz",
"./data/fbank/aishell2_cuts_dev.jsonl.gz",
"./data/fbank/aishell2_cuts_test.jsonl.gz",
]

for path in paths:
print(f"Starting display the statistics for {path}")
cuts = load_manifest_lazy(path)
cuts.describe()


if __name__ == "__main__":
main()

"""
Starting display the statistics for ./data/fbank/aishell2_cuts_train.jsonl.gz
Cuts count: 3026106
Total duration (hours): 3021.2
Speech duration (hours): 3021.2 (100.0%)
***
Duration statistics (seconds):
mean 3.6
std 1.5
min 0.3
25% 2.4
50% 3.3
75% 4.4
99% 8.2
99.5% 8.9
99.9% 10.6
max 21.5
Starting display the statistics for ./data/fbank/aishell2_cuts_dev.jsonl.gz
Cuts count: 2500
Total duration (hours): 2.0
Speech duration (hours): 2.0 (100.0%)
***
Duration statistics (seconds):
mean 2.9
std 1.0
min 1.1
25% 2.2
50% 2.7
75% 3.4
99% 6.3
99.5% 6.7
99.9% 7.8
max 9.4
Starting display the statistics for ./data/fbank/aishell2_cuts_test.jsonl.gz
Cuts count: 5000
Total duration (hours): 4.0
Speech duration (hours): 4.0 (100.0%)
***
Duration statistics (seconds):
mean 2.9
std 1.0
min 1.1
25% 2.2
50% 2.7
75% 3.3
99% 6.2
99.5% 6.6
99.9% 7.7
max 8.5
"""
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/prepare_char.py
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/prepare_lang.py
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/prepare_words.py
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/text2segments.py
1 change: 1 addition & 0 deletions egs/aishell2/ASR/local/text2token.py
Loading