k2-fsa · csukuangfj · Jul 14, 2022 · Jul 6, 2022 · Jul 6, 2022 · Jul 7, 2022
diff --git a/egs/aishell2/ASR/README.md b/egs/aishell2/ASR/README.md
@@ -0,0 +1,19 @@
+
+# Introduction
+
+This recipe includes some different ASR models trained with Aishell2.
+
+[./RESULTS.md](./RESULTS.md) contains the latest results.
+
+# Transducers
+
+There are various folders containing the name `transducer` in this folder.
+The following table lists the differences among them.
+
+|                                       | Encoder             | Decoder            | Comment                     |
+|---------------------------------------|---------------------|--------------------|-----------------------------|
+| `pruned_transducer_stateless5`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless5 in librispeech recipe  |
+
+The decoder in `transducer_stateless` is modified from the paper
+[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
+We place an additional Conv1d layer right after the input embedding layer.
diff --git a/egs/aishell2/ASR/RESULTS.md b/egs/aishell2/ASR/RESULTS.md
@@ -0,0 +1,89 @@
+## Results
+
+### Aishell2 char-based training results (Pruned Transducer 5)
+
+#### 2022-07-11
+
+Using the codes from this commit https://github.com/k2-fsa/icefall/pull/465.
+
+When training with context size equals to 1, the WERs are
+
+|                                    |  dev-ios  | test-ios | comment                      |
+|------------------------------------|-------|----------|----------------------------------|
+|          greedy search             | 5.57  | 5.89     | --epoch 25, --avg 5, --max-duration 600  |
+| modified beam search (beam size 4) | 5.32  | 5.56     | --epoch 25, --avg 5, --max-duration 600  |
+| fast beam search (set as default)  | 5.5   |  5.78    | --epoch 25, --avg 5, --max-duration 600 |
+| fast beam search nbest             | 5.46  |  5.74    | --epoch 25, --avg 5, --max-duration 600 |
+| fast beam search oracle            | 1.92  |  2.2     | --epoch 25, --avg 5, --max-duration 600 |
+| fast beam search nbest LG          | 5.59  |  5.93    | --epoch 25, --avg 5, --max-duration 600 |
+
+The training command for reproducing is given below:
+
+```bash
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+
+./pruned_transducer_stateless5/train.py \
+  --world-size 4 \
+  --lang-dir data/lang_char \
+  --num-epochs 40 \
+  --start-epoch 1 \
+  --exp-dir /result \
+  --max-duration 300 \
+  --use-fp16 0 \
+  --num-encoder-layers 24 \
+  --dim-feedforward 1536 \
+  --nhead 8 \
+  --encoder-dim 384 \
+  --decoder-dim 512 \
+  --joiner-dim 512 \
+  --context-size 1
+```
+
+The decoding command is:
+```bash
+for method in greedy_search modified_beam_search \
+              fast_beam_search fast_beam_search_nbest \
+              fast_beam_search_nbest_oracle fast_beam_search_nbest_LG; do
+  ./pruned_transducer_stateless5/decode.py \
+    --epoch 25 \
+    --avg 5 \
+    --exp-dir ./pruned_transducer_stateless5/exp \
+    --max-duration 600 \
+    --decoding-method $method \
+    --max-sym-per-frame 1 \
+    --num-encoder-layers 24 \
+    --dim-feedforward 1536 \
+    --nhead 8 \
+    --encoder-dim 384 \
+    --decoder-dim 512 \
+    --joiner-dim 512 \
+    --context-size 1 \
+    --beam 20.0 \
+    --max-contexts 8 \
+    --max-states 64 \
+    --num-paths 200 \
+    --nbest-scale 0.5 \
+    --context-size 1 \
+    --use-averaged-model True
+done
+```
+The tensorboard training log can be found at
+https://tensorboard.dev/experiment/RXyX4QjQQVKjBS2eQ2Qajg/#scalars
+
+A pre-trained model and decoding logs can be found at <https://huggingface.co/yuekai/icefall-asr-aishell2-pruned-transducer-stateless5-B-2022-07-12>
+
+When training with context size equals to 2, the WERs are
+
+|                                    |  dev-ios  | test-ios | comment                      |
+|------------------------------------|-------|----------|----------------------------------|
+|          greedy search             | 5.47  |  5.81    | --epoch 25, --avg 5, --max-duration 600  |
+| modified beam search (beam size 4) | 5.38  |  5.61    | --epoch 25, --avg 5, --max-duration 600  |
+| fast beam search (set as default)  | 5.36  |  5.61    | --epoch 25, --avg 5, --max-duration 600  |
+| fast beam search nbest             | 5.37  |  5.6     | --epoch 25, --avg 5, --max-duration 600 |
+| fast beam search oracle            | 2.04  |  2.2     | --epoch 25, --avg 5, --max-duration 600 |
+| fast beam search nbest LG          | 5.59  |  5.82     | --epoch 25, --avg 5, --max-duration 600 |
+
+The tensorboard training log can be found at
+https://tensorboard.dev/experiment/5AxJ8LHoSre8kDAuLp4L7Q/#scalars
+
+A pre-trained model and decoding logs can be found at <https://huggingface.co/yuekai/icefall-asr-aishell2-pruned-transducer-stateless5-A-2022-07-12>
diff --git a/egs/aishell2/ASR/local/__init__.py b/egs/aishell2/ASR/local/__init__.py
diff --git a/egs/aishell2/ASR/local/compute_fbank_aishell2.py b/egs/aishell2/ASR/local/compute_fbank_aishell2.py
@@ -0,0 +1,114 @@
+#!/usr/bin/env python3
+# Copyright    2021  Xiaomi Corp.        (authors: Fangjun Kuang)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""
+This file computes fbank features of the aishell2 dataset.
+It looks for manifests in the directory data/manifests.
+
+The generated fbank features are saved in data/fbank.
+"""
+
+import argparse
+import logging
+import os
+from pathlib import Path
+
+import torch
+from lhotse import CutSet, Fbank, FbankConfig, LilcomChunkyWriter
+from lhotse.recipes.utils import read_manifests_if_cached
+
+from icefall.utils import get_executor
+
+# Torch's multithreaded behavior needs to be disabled or
+# it wastes a lot of CPU and slow things down.
+# Do this outside of main() in case it needs to take effect
+# even when we are not invoking the main (e.g. when spawning subprocesses).
+torch.set_num_threads(1)
+torch.set_num_interop_threads(1)
+
+
+def compute_fbank_aishell2(num_mel_bins: int = 80):
+    src_dir = Path("data/manifests")
+    output_dir = Path("data/fbank")
+    num_jobs = min(15, os.cpu_count())
+
+    dataset_parts = (
+        "train",
+        "dev",
+        "test",
+    )
+    prefix = "aishell2"
+    suffix = "jsonl.gz"
+    manifests = read_manifests_if_cached(
+        dataset_parts=dataset_parts,
+        output_dir=src_dir,
+        prefix=prefix,
+        suffix=suffix,
+    )
+    assert manifests is not None
+
+    extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
+
+    with get_executor() as ex:  # Initialize the executor only once.
+        for partition, m in manifests.items():
+            if (output_dir / f"{prefix}_cuts_{partition}.{suffix}").is_file():
+                logging.info(f"{partition} already exists - skipping.")
+                continue
+            logging.info(f"Processing {partition}")
+            cut_set = CutSet.from_manifests(
+                recordings=m["recordings"],
+                supervisions=m["supervisions"],
+            )
+            if "train" in partition:
+                cut_set = (
+                    cut_set
+                    + cut_set.perturb_speed(0.9)
+                    + cut_set.perturb_speed(1.1)
+                )
+            cut_set = cut_set.compute_and_store_features(
+                extractor=extractor,
+                storage_path=f"{output_dir}/{prefix}_feats_{partition}",
+                # when an executor is specified, make more partitions
+                num_jobs=num_jobs if ex is None else 80,
+                executor=ex,
+                storage_type=LilcomChunkyWriter,
+            )
+            cut_set.to_file(output_dir / f"{prefix}_cuts_{partition}.{suffix}")
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--num-mel-bins",
+        type=int,
+        default=80,
+        help="""The number of mel bins for Fbank""",
+    )
+
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    formatter = (
+        "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
+    )
+
+    logging.basicConfig(format=formatter, level=logging.INFO)
+
+    args = get_args()
+    compute_fbank_aishell2(num_mel_bins=args.num_mel_bins)
diff --git a/egs/aishell2/ASR/local/compute_fbank_musan.py b/egs/aishell2/ASR/local/compute_fbank_musan.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/local/compute_fbank_musan.py
diff --git a/egs/aishell2/ASR/local/display_manifest_statistics.py b/egs/aishell2/ASR/local/display_manifest_statistics.py
@@ -0,0 +1,96 @@
+#!/usr/bin/env python3
+# Copyright    2021  Xiaomi Corp.        (authors: Fangjun Kuang)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This file displays duration statistics of utterances in a manifest.
+You can use the displayed value to choose minimum/maximum duration
+to remove short and long utterances during the training.
+
+See the function `remove_short_and_long_utt()` in transducer_stateless/train.py
+for usage.
+"""
+
+
+from lhotse import load_manifest_lazy
+
+
+def main():
+    paths = [
+        "./data/fbank/aishell2_cuts_train.jsonl.gz",
+        "./data/fbank/aishell2_cuts_dev.jsonl.gz",
+        "./data/fbank/aishell2_cuts_test.jsonl.gz",
+    ]
+
+    for path in paths:
+        print(f"Starting display the statistics for {path}")
+        cuts = load_manifest_lazy(path)
+        cuts.describe()
+
+
+if __name__ == "__main__":
+    main()
+
+"""
+Starting display the statistics for ./data/fbank/aishell2_cuts_train.jsonl.gz
+Cuts count: 3026106
+Total duration (hours): 3021.2
+Speech duration (hours): 3021.2 (100.0%)
+***
+Duration statistics (seconds):
+mean	3.6
+std	1.5
+min	0.3
+25%	2.4
+50%	3.3
+75%	4.4
+99%	8.2
+99.5%	8.9
+99.9%	10.6
+max	21.5
+Starting display the statistics for ./data/fbank/aishell2_cuts_dev.jsonl.gz
+Cuts count: 2500
+Total duration (hours): 2.0
+Speech duration (hours): 2.0 (100.0%)
+***
+Duration statistics (seconds):
+mean	2.9
+std	1.0
+min	1.1
+25%	2.2
+50%	2.7
+75%	3.4
+99%	6.3
+99.5%	6.7
+99.9%	7.8
+max	9.4
+Starting display the statistics for ./data/fbank/aishell2_cuts_test.jsonl.gz
+Cuts count: 5000
+Total duration (hours): 4.0
+Speech duration (hours): 4.0 (100.0%)
+***
+Duration statistics (seconds):
+mean	2.9
+std	1.0
+min	1.1
+25%	2.2
+50%	2.7
+75%	3.3
+99%	6.2
+99.5%	6.6
+99.9%	7.7
+max	8.5
+"""
diff --git a/egs/aishell2/ASR/local/prepare_char.py b/egs/aishell2/ASR/local/prepare_char.py
@@ -0,0 +1 @@
+../../../aidatatang_200zh/ASR/local/prepare_char.py
diff --git a/egs/aishell2/ASR/local/prepare_lang.py b/egs/aishell2/ASR/local/prepare_lang.py
@@ -0,0 +1 @@
+../../../wenetspeech/ASR/local/prepare_lang.py
diff --git a/egs/aishell2/ASR/local/prepare_words.py b/egs/aishell2/ASR/local/prepare_words.py
@@ -0,0 +1 @@
+../../../wenetspeech/ASR/local/prepare_words.py
diff --git a/egs/aishell2/ASR/local/text2segments.py b/egs/aishell2/ASR/local/text2segments.py
@@ -0,0 +1 @@
+../../../wenetspeech/ASR/local/text2segments.py
diff --git a/egs/aishell2/ASR/local/text2token.py b/egs/aishell2/ASR/local/text2token.py
@@ -0,0 +1 @@
+../../../aidatatang_200zh/ASR/local/text2token.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../librispeech/ASR/local/compute_fbank_musan.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../aidatatang_200zh/ASR/local/prepare_char.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../wenetspeech/ASR/local/prepare_words.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../wenetspeech/ASR/local/text2segments.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../aidatatang_200zh/ASR/local/text2token.py