Instruction

The instruction on data format & configuration rules.

Data Preparation

The APS has its own module to perform feature extraction (both ASR and enhancement/separation task are supported, see aps.transform) so it will not take us much time on data preparation.

Acoustic Model

kaldi feature dataloader. Considering many people use kaldi features before, APS still provide the dataloader to support kaldi-style features. It requires feats.scp, utt2num_frames, text (or other name based on the tokenization method).

feats.scp and utt2num_frames are generated by Kaldi toolkit, e.g., steps/make_{mfcc,fbank}.sh, while text is tokenized from the transcriptions (sequences of character, phoneme, word, word pieces .etc), following the format in Kaldi:

BAC009S0002W0126    仅 一 个 多 月 的 时 间 里
BAC009S0002W0127    除 了 北 京 上 海 广 州 深 圳 四 个 一 线 城 市 和 三 亚 之 外
BAC009S0002W0128    四 十 六 个 限 购 城 市 当 中
BAC009S0002W0129    四 十 一 个 已 正 式 取 消 或 变 相 放 松 了 限 购
BAC009S0002W0130    财 政 金 融 政 策 紧 随 其 后 而 来
BAC009S0002W0131    显 示 出 了 极 强 的 威 力

for Chinese character and

1230-139216-0040 ▁GIVE ▁POST ▁OFFICE ▁PEOPLE ▁ORDER S ▁NOT ▁TO ▁LET ▁THIS ▁OUT
1230-139225-0004 ▁IT ▁CONVEY S ▁NOTHING ▁TO ▁ME
1230-139225-0007 ▁MISTER ▁PORT LE THORPE ▁LOOK ED ▁AND ▁WAS ▁STARTLE D ▁OUT ▁OF ▁HIS ▁PE E V ISH NESS
1230-139225-0015 ▁ARE ▁ONE ▁AND ▁THE ▁SAME ▁MAN ▁OR ▁I ▁SHOULD ▁SAY
1230-139225-0021 ▁SO ▁NOW ▁COME ▁TO ▁YOUR ▁BED S

for English word pieces.

utt2num_frames is used to sort utterances to form the mini-batch that minimized the padding size. APS adopts my another package kaldi-python-io as the backend of Kaldi IO.

raw waveform dataloader (recommonded by the author). It requires wav.scp, utt2dur, text.

wav.scp follows the definition in Kaldi, e.g.,

BAC009S0002W0126    /scratch/jwu/aishell_v1/train/BAC009S0002W0126.wav
BAC009S0002W0127    /scratch/jwu/aishell_v1/train/BAC009S0002W0127.wav
BAC009S0002W0128    /scratch/jwu/aishell_v1/train/BAC009S0002W0128.wav
BAC009S0002W0129    /scratch/jwu/aishell_v1/train/BAC009S0002W0129.wav
BAC009S0002W0130    /scratch/jwu/aishell_v1/train/BAC009S0002W0130.wav
BAC009S0002W0131    /scratch/jwu/aishell_v1/train/BAC009S0002W0131.wav

and utt2dur prescribes the duration of each utterance, e.g.,:

BAC009S0002W0126    2.5760
BAC009S0002W0127    5.8900
BAC009S0002W0128    3.3930
BAC009S0002W0129    6.1360
BAC009S0002W0130    4.1300
BAC009S0002W0131    4.4060

We provide script scripts/get_wav_dur.sh to generate the duration file. Note that APS also supports "pipe" or "archive" style input, so the wav.scp like:

FAEM0_SI762 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SI762.WAV -t wav - |
FAEM0_SX132 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX132.WAV -t wav - |
FAEM0_SX222 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX222.WAV -t wav - |
FAEM0_SX312 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX312.WAV -t wav - |
FAEM0_SX402 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX402.WAV -t wav - |

and

100-121669-0004 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1072576
100-121669-0005 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1556796
100-121669-0006 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2025816
100-121669-0007 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2306036
100-121669-0008 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2473296
100-121669-0009 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2591436

are both OK. The script scripts/archive_wav.sh can used to archive the small audio files if needed. The text is same as the one used in am_kaldi dataloader. utt2dur is also used to sort utterances to form the mini-batch that minimized the padding size.

The dictionary required by AM training is a text file where each line contains <model-unit> <int-id> pair. The following is the example from Librispeech dataset which using 6K word pieces as the model units:

<unk> 0
<sos> 1
<eos> 2
▁THE 3
S 4
▁AND 5
ED 6
▁OF 7
▁TO 8
...
...
CORPOREAL 5998
▁CHOCOLATE 5999

utils/tokenizer.py can be used to tokenize the transcriptions and generate the dictionary.

Enhancement/Separation Model

chunk dataloader only requires several wave script, e.g., mix.scp,spk{1,2}.scp for separation tasks. Each one follows Kaldi's style. The dataloader will split the utterances into fixed length audio chunks on-the-fly.

online dataloader simulates and splits the audio on-the-fly and it only requires one configuration script for data simulation. The format of each line follows the pattern <key> <command-options>. See wav_simulate.py for details of the <command-options>:

usage: wav_simulate.py [-h] [--dump-ref-dir DUMP_REF_DIR] --src-spk SRC_SPK
                      [--src-rir SRC_RIR] [--src-sdr SRC_SDR] [--src-begin SRC_BEGIN]
                      [--point-noise POINT_NOISE] [--point-noise-rir POINT_NOISE_RIR]
                      [--point-noise-snr POINT_NOISE_SNR]
                      [--point-noise-begin POINT_NOISE_BEGIN]
                      [--point-noise-offset POINT_NOISE_OFFSET]
                      [--point-noise-repeat POINT_NOISE_REPEAT]
                      [--isotropic-noise ISOTROPIC_NOISE]
                      [--isotropic-noise-snr ISOTROPIC_NOISE_SNR]
                      [--isotropic-noise-offset ISOTROPIC_NOISE_OFFSET]
                      [--dump-channel DUMP_CHANNEL] [--norm-factor NORM_FACTOR]
                      [--sr SR]
                      mix

For example:

two_speaker_mix01  --src-spk /path/to/1462-170142-0008.wav,/path/to/1462-170142-0006.wav --src-begin 0,0 --src-sdr 3
two_speaker_mix02  --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0003.wav --src-begin 4000,0 --src-sdr 1
two_speaker_mix03  --src-spk /path/to/1462-170142-0009.wav,/path/to/1462-170142-0008.wav --src-begin 0,900 --src-sdr -1
two_speaker_mix04  --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0001.wav --src-begin 0,1800 --src-sdr 0

For more details on <command-options>, see setk's data simulation

Experimental Configurations

Almost all the hyper-parameters are configured in the yaml files. You can check the examples in conf. The followings are allowed configuration keys:

nnet and nnet_conf: Name of the networks and it's parameters. Several supported networks have been registered using decorator class ApsRegisters, e.g., @ApsRegisters.asr.register("att") (refer to original code in aps/asr and aps/sse). An example to train attention based acoustic model:

nnet: "att" # network's registered name
nnet_conf:  # parameters of the network
  input_size: 80
  enc_type: "concat"
  enc_proj: 1024
  enc_kwargs:
    conv2d:
      out_features: -1
      channel: 32
      num_layers: 2
      stride: 2
      padding: 1
      kernel_size: 3
    pytorch_rnn:
      hidden: 1024
      dropout: 0.3
      num_layers: 4
      bidirectional: true
  dec_dim: 1024
  dec_kwargs:
    dec_rnn: "lstm"
    rnn_layers: 2
    rnn_hidden: 1024
    rnn_dropout: 0.2
    emb_dropout: 0.2
    dropout: 0.2
    input_feeding: true
    vocab_embeded: true
  att_type: "ctx"
  att_kwargs:
    att_dim: 1024

and a DCCRN (denoising) network:

nnet: dccrn
nnet_conf:
  cplx: true
  K: "3,3;3,3;3,3;3,3;3,3;3,3;3,3"
  S: "2,1;2,1;2,1;2,1;2,1;2,1;2,1"
  P: "1,1,1,1,1,0,0"
  O: "0,0,0,0,0,0,1"
  C: "16,32,64,64,128,128,256"
  num_spks: 1
  rnn_resize: 512
  non_linear: tanh
  connection: cat

task and task_conf: Name of the tasks and it's parameters. The supported tasks have been decorated using @ApsRegisters.task.register("...") (see aps/task). An example for permutation invariant training using SiSNR (scale-invariant SNR) objective function:
```
task: "sisnr"
task_conf:
  num_spks: 2
  permute: true
  zero_mean: false
```
and jointly CTC and cross-entropy training for acoustic model:
```
task: ctc_xent
task_conf:
  # CTC weight
  ctc_weight: 0.2
  # label smoothing factor
  lsm_factor: 0.1
  # label smoothing method
  lsm_method: "uniform"
```

data_conf: Parameters for dataloader. An example of raw waveform dataloader for AM training:

data_conf:
  fmt: "am@raw"
  loader:
    # adaptive or constraint
    batch_mode: adaptive
    max_token_num: 400
    max_dur: 30 # (s)
    min_dur: 0.4 # (s)
    # for adaptive one, batch number will halve when
    #   1) #token_num > adapt_token_num
    #   2) #utt_dur   > adapt_dur
    adapt_token_num: 150
    adapt_dur: 10 # (s)
    # for constraint one, batch number is the
    # maximum number that satisfies #utt_dur <= batch_size
  train:
    wav_scp: "data/aishell_v1/train/wav.scp"
    utt2dur: "data/aishell_v1/train/utt2dur"
    text: "data/aishell_v1/train/text"
  valid:
    wav_scp: "data/aishell_v1/dev/wav.scp"
    utt2dur: "data/aishell_v1/dev/utt2dur"
    text: "data/aishell_v1/dev/text"

To use the kaldi feature, we have

data_conf:
  fmt: "am@kaldi"
  loader:
    # adaptive or constraint
    batch_mode: adaptive
    max_token_num: 400
    max_dur: 3000 #num_frames
    min_dur: 100 #num_frames
    adapt_token_num: 150
    adapt_dur: 1000 #num_frames
  train:
    feats_scp: "data/aishell_v1/train/feats.scp"
    utt2num_frames: "data/aishell_v1/train/utt2num_frames"
    text: "data/aishell_v1/train/text"
  valid:
    feats_scp: "data/aishell_v1/dev/feats.scp"
    utt2num_frames: "data/aishell_v1/dev/utt2num_frames"
    text: "data/aishell_v1/dev/text"

And the chunk dataloader for separation model training:

data_conf:
  fmt: "ss@chunk"
  loader:
    # chunk size (in samples)
    chunk_size: 32000
    # sample rate
    sr: 8000
  train:
    mix_scp: "data/wsj0_2mix/tr/mix.scp"
    ref_scp: "data/wsj0_2mix/tr/spk1.scp,data/wsj0_2mix/tr/spk2.scp"
  valid:
    mix_scp: "data/wsj0_2mix/cv/mix.scp"
    ref_scp: "data/wsj0_2mix/cv/spk1.scp,data/wsj0_2mix/cv/spk2.scp"

trainer_conf: Parameters for Trainer class, including optimizer, learning rate scheduler, schedule sampling scheduler (if necessary). For example:

trainer_conf:
  # optimizer and parameters
  optimizer: "adam"
  optimizer_kwargs:
      lr: 1.0e-3
      weight_decay: 1.0e-5
  lr_scheduler: "reduce_lr"
  # run lr_scheduler every epoch or step
  lr_scheduler_period: "epoch" # or "step"
  lr_scheduler_kwargs:
      min_lr: 1.0e-8
      patience: 1
      factor: 0.5
  # scheduler sampling, for AM training only
  ss_scheduler: "linear"
  ss_scheduler_kwargs:
      ssr: 0.2
      epoch_beg: 10
      epoch_end: 26
      update_interval: 4
  # gradient clipping (norm)
  clip_gradient: 5
  # for early stop detection
  no_impr: 6
  no_impr_thres: 0.01
  # report metrics on validation epoch
  report_metrics: ["loss", "accu", "@ctc"]
  stop_criterion: "accu"
  average_checkpoint: false

enh_transform: Feature configurations for enhancement/separation tasks. Refer aps/transform/enh.py for all the supported parameters. An example that uses log spectrogram feature concatenated with cos-IPDs:

enh_transform:
  feats: spectrogram-log-ipd
  frame_len: 512
  frame_hop: 256
  window: sqrthann
  center: true
  # librosa or kaldi
  mode: librosa
  ipd_index: 0,1;0,2;0,3;0,4
  cos_ipd: true

asr_transform: Feature configurations for ASR tasks. Refer aps/transform/asr.py for all the supported parameters. For examples, if we want to adopt log-fbank feature with spectral augumentation (SpecAugment) and utterance-level mean variance normalization, we have:

asr_transform:
  feats: perturb-fbank-log-cmvn-aug
  frame_len: 400
  frame_hop: 160
  window: hamm
  center: false
  pre_emphasis: 0.97
  mode: librosa
  round_pow_of_two: true
  sr: 16000
  num_mels: 80
  norm_mean: true
  norm_var: true
  speed_perturb: 0.9,1.0,1.1
  aug_prob: 1
  aug_time_args: [100, 1]
  aug_freq_args: [27, 1]
  aug_mask_zero: true

Note that the above configuration is used with raw waveform dataloader. For kaldi format features, the configuration is more simple (only needs to apply cmvn and SpecAugment):

asr_transform:
  feats: cmvn-aug
  sr: 16000
  norm_mean: true
  norm_var: true
  gcmvn: data/aishell_v1/train/cmvn.ark
  aug_prob: 1
  aug_time_args: [100, 1]
  aug_freq_args: [27, 1]
  aug_mask_zero: true

The other options, e.g., batch size, number of the epochs, are passed as the script arguments. See aps/scripts/train.sh and aps/scripts/distributed_train.sh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instruction.md

instruction.md

Instruction

Data Preparation

Acoustic Model

Enhancement/Separation Model

Experimental Configurations

Files

instruction.md

Latest commit

History

instruction.md

File metadata and controls

Instruction

Data Preparation

Acoustic Model

Enhancement/Separation Model

Experimental Configurations