Skip to content

Latest commit

 

History

History
341 lines (318 loc) · 14.1 KB

instruction.md

File metadata and controls

341 lines (318 loc) · 14.1 KB

Instruction

The instruction on data format & configuration rules.

Data Preparation

The APS has its own module to perform feature extraction (both ASR and enhancement/separation task are supported, see aps.transform) so it will not take us much time on data preparation.

Acoustic Model

  1. kaldi feature dataloader. Considering many people use kaldi features before, APS still provide the dataloader to support kaldi-style features. It requires feats.scp, utt2num_frames, text (or other name based on the tokenization method).

    feats.scp and utt2num_frames are generated by Kaldi toolkit, e.g., steps/make_{mfcc,fbank}.sh, while text is tokenized from the transcriptions (sequences of character, phoneme, word, word pieces .etc), following the format in Kaldi:

    BAC009S0002W0126    仅 一 个 多 月 的 时 间 里
    BAC009S0002W0127    除 了 北 京 上 海 广 州 深 圳 四 个 一 线 城 市 和 三 亚 之 外
    BAC009S0002W0128    四 十 六 个 限 购 城 市 当 中
    BAC009S0002W0129    四 十 一 个 已 正 式 取 消 或 变 相 放 松 了 限 购
    BAC009S0002W0130    财 政 金 融 政 策 紧 随 其 后 而 来
    BAC009S0002W0131    显 示 出 了 极 强 的 威 力
    

    for Chinese character and

    1230-139216-0040 ▁GIVE ▁POST ▁OFFICE ▁PEOPLE ▁ORDER S ▁NOT ▁TO ▁LET ▁THIS ▁OUT
    1230-139225-0004 ▁IT ▁CONVEY S ▁NOTHING ▁TO ▁ME
    1230-139225-0007 ▁MISTER ▁PORT LE THORPE ▁LOOK ED ▁AND ▁WAS ▁STARTLE D ▁OUT ▁OF ▁HIS ▁PE E V ISH NESS
    1230-139225-0015 ▁ARE ▁ONE ▁AND ▁THE ▁SAME ▁MAN ▁OR ▁I ▁SHOULD ▁SAY
    1230-139225-0021 ▁SO ▁NOW ▁COME ▁TO ▁YOUR ▁BED S
    

    for English word pieces.

    utt2num_frames is used to sort utterances to form the mini-batch that minimized the padding size. APS adopts my another package kaldi-python-io as the backend of Kaldi IO.

  2. raw waveform dataloader (recommonded by the author). It requires wav.scp, utt2dur, text.

    wav.scp follows the definition in Kaldi, e.g.,

    BAC009S0002W0126    /scratch/jwu/aishell_v1/train/BAC009S0002W0126.wav
    BAC009S0002W0127    /scratch/jwu/aishell_v1/train/BAC009S0002W0127.wav
    BAC009S0002W0128    /scratch/jwu/aishell_v1/train/BAC009S0002W0128.wav
    BAC009S0002W0129    /scratch/jwu/aishell_v1/train/BAC009S0002W0129.wav
    BAC009S0002W0130    /scratch/jwu/aishell_v1/train/BAC009S0002W0130.wav
    BAC009S0002W0131    /scratch/jwu/aishell_v1/train/BAC009S0002W0131.wav
    

    and utt2dur prescribes the duration of each utterance, e.g.,:

    BAC009S0002W0126    2.5760
    BAC009S0002W0127    5.8900
    BAC009S0002W0128    3.3930
    BAC009S0002W0129    6.1360
    BAC009S0002W0130    4.1300
    BAC009S0002W0131    4.4060
    

    We provide script scripts/get_wav_dur.sh to generate the duration file. Note that APS also supports "pipe" or "archive" style input, so the wav.scp like:

    FAEM0_SI762 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SI762.WAV -t wav - |
    FAEM0_SX132 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX132.WAV -t wav - |
    FAEM0_SX222 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX222.WAV -t wav - |
    FAEM0_SX312 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX312.WAV -t wav - |
    FAEM0_SX402 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX402.WAV -t wav - |
    

    and

    100-121669-0004 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1072576
    100-121669-0005 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1556796
    100-121669-0006 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2025816
    100-121669-0007 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2306036
    100-121669-0008 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2473296
    100-121669-0009 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2591436
    

    are both OK. The script scripts/archive_wav.sh can used to archive the small audio files if needed. The text is same as the one used in am_kaldi dataloader. utt2dur is also used to sort utterances to form the mini-batch that minimized the padding size.

The dictionary required by AM training is a text file where each line contains <model-unit> <int-id> pair. The following is the example from Librispeech dataset which using 6K word pieces as the model units:

<unk> 0
<sos> 1
<eos> 2
▁THE 3
S 4
▁AND 5
ED 6
▁OF 7
▁TO 8
...
...
CORPOREAL 5998
▁CHOCOLATE 5999

utils/tokenizer.py can be used to tokenize the transcriptions and generate the dictionary.

Enhancement/Separation Model

  1. chunk dataloader only requires several wave script, e.g., mix.scp,spk{1,2}.scp for separation tasks. Each one follows Kaldi's style. The dataloader will split the utterances into fixed length audio chunks on-the-fly.

  2. online dataloader simulates and splits the audio on-the-fly and it only requires one configuration script for data simulation. The format of each line follows the pattern <key> <command-options>. See wav_simulate.py for details of the <command-options>:

    usage: wav_simulate.py [-h] [--dump-ref-dir DUMP_REF_DIR] --src-spk SRC_SPK
                          [--src-rir SRC_RIR] [--src-sdr SRC_SDR] [--src-begin SRC_BEGIN]
                          [--point-noise POINT_NOISE] [--point-noise-rir POINT_NOISE_RIR]
                          [--point-noise-snr POINT_NOISE_SNR]
                          [--point-noise-begin POINT_NOISE_BEGIN]
                          [--point-noise-offset POINT_NOISE_OFFSET]
                          [--point-noise-repeat POINT_NOISE_REPEAT]
                          [--isotropic-noise ISOTROPIC_NOISE]
                          [--isotropic-noise-snr ISOTROPIC_NOISE_SNR]
                          [--isotropic-noise-offset ISOTROPIC_NOISE_OFFSET]
                          [--dump-channel DUMP_CHANNEL] [--norm-factor NORM_FACTOR]
                          [--sr SR]
                          mix
    

    For example:

    two_speaker_mix01  --src-spk /path/to/1462-170142-0008.wav,/path/to/1462-170142-0006.wav --src-begin 0,0 --src-sdr 3
    two_speaker_mix02  --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0003.wav --src-begin 4000,0 --src-sdr 1
    two_speaker_mix03  --src-spk /path/to/1462-170142-0009.wav,/path/to/1462-170142-0008.wav --src-begin 0,900 --src-sdr -1
    two_speaker_mix04  --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0001.wav --src-begin 0,1800 --src-sdr 0
    

    For more details on <command-options>, see setk's data simulation

Experimental Configurations

Almost all the hyper-parameters are configured in the yaml files. You can check the examples in conf. The followings are allowed configuration keys:

  • nnet and nnet_conf: Name of the networks and it's parameters. Several supported networks have been registered using decorator class ApsRegisters, e.g., @ApsRegisters.asr.register("att") (refer to original code in aps/asr and aps/sse). An example to train attention based acoustic model:

    nnet: "att" # network's registered name
    nnet_conf:  # parameters of the network
      input_size: 80
      enc_type: "concat"
      enc_proj: 1024
      enc_kwargs:
        conv2d:
          out_features: -1
          channel: 32
          num_layers: 2
          stride: 2
          padding: 1
          kernel_size: 3
        pytorch_rnn:
          hidden: 1024
          dropout: 0.3
          num_layers: 4
          bidirectional: true
      dec_dim: 1024
      dec_kwargs:
        dec_rnn: "lstm"
        rnn_layers: 2
        rnn_hidden: 1024
        rnn_dropout: 0.2
        emb_dropout: 0.2
        dropout: 0.2
        input_feeding: true
        vocab_embeded: true
      att_type: "ctx"
      att_kwargs:
        att_dim: 1024

    and a DCCRN (denoising) network:

    nnet: dccrn
    nnet_conf:
      cplx: true
      K: "3,3;3,3;3,3;3,3;3,3;3,3;3,3"
      S: "2,1;2,1;2,1;2,1;2,1;2,1;2,1"
      P: "1,1,1,1,1,0,0"
      O: "0,0,0,0,0,0,1"
      C: "16,32,64,64,128,128,256"
      num_spks: 1
      rnn_resize: 512
      non_linear: tanh
      connection: cat
  • task and task_conf: Name of the tasks and it's parameters. The supported tasks have been decorated using @ApsRegisters.task.register("...") (see aps/task). An example for permutation invariant training using SiSNR (scale-invariant SNR) objective function:

    task: "sisnr"
    task_conf:
      num_spks: 2
      permute: true
      zero_mean: false

    and jointly CTC and cross-entropy training for acoustic model:

    task: ctc_xent
    task_conf:
      # CTC weight
      ctc_weight: 0.2
      # label smoothing factor
      lsm_factor: 0.1
      # label smoothing method
      lsm_method: "uniform"
  • data_conf: Parameters for dataloader. An example of raw waveform dataloader for AM training:

    data_conf:
      fmt: "am@raw"
      loader:
        # adaptive or constraint
        batch_mode: adaptive
        max_token_num: 400
        max_dur: 30 # (s)
        min_dur: 0.4 # (s)
        # for adaptive one, batch number will halve when
        #   1) #token_num > adapt_token_num
        #   2) #utt_dur   > adapt_dur
        adapt_token_num: 150
        adapt_dur: 10 # (s)
        # for constraint one, batch number is the
        # maximum number that satisfies #utt_dur <= batch_size
      train:
        wav_scp: "data/aishell_v1/train/wav.scp"
        utt2dur: "data/aishell_v1/train/utt2dur"
        text: "data/aishell_v1/train/text"
      valid:
        wav_scp: "data/aishell_v1/dev/wav.scp"
        utt2dur: "data/aishell_v1/dev/utt2dur"
        text: "data/aishell_v1/dev/text"

    To use the kaldi feature, we have

    data_conf:
      fmt: "am@kaldi"
      loader:
        # adaptive or constraint
        batch_mode: adaptive
        max_token_num: 400
        max_dur: 3000 #num_frames
        min_dur: 100 #num_frames
        adapt_token_num: 150
        adapt_dur: 1000 #num_frames
      train:
        feats_scp: "data/aishell_v1/train/feats.scp"
        utt2num_frames: "data/aishell_v1/train/utt2num_frames"
        text: "data/aishell_v1/train/text"
      valid:
        feats_scp: "data/aishell_v1/dev/feats.scp"
        utt2num_frames: "data/aishell_v1/dev/utt2num_frames"
        text: "data/aishell_v1/dev/text"

    And the chunk dataloader for separation model training:

    data_conf:
      fmt: "ss@chunk"
      loader:
        # chunk size (in samples)
        chunk_size: 32000
        # sample rate
        sr: 8000
      train:
        mix_scp: "data/wsj0_2mix/tr/mix.scp"
        ref_scp: "data/wsj0_2mix/tr/spk1.scp,data/wsj0_2mix/tr/spk2.scp"
      valid:
        mix_scp: "data/wsj0_2mix/cv/mix.scp"
        ref_scp: "data/wsj0_2mix/cv/spk1.scp,data/wsj0_2mix/cv/spk2.scp"
  • trainer_conf: Parameters for Trainer class, including optimizer, learning rate scheduler, schedule sampling scheduler (if necessary). For example:

    trainer_conf:
      # optimizer and parameters
      optimizer: "adam"
      optimizer_kwargs:
          lr: 1.0e-3
          weight_decay: 1.0e-5
      lr_scheduler: "reduce_lr"
      # run lr_scheduler every epoch or step
      lr_scheduler_period: "epoch" # or "step"
      lr_scheduler_kwargs:
          min_lr: 1.0e-8
          patience: 1
          factor: 0.5
      # scheduler sampling, for AM training only
      ss_scheduler: "linear"
      ss_scheduler_kwargs:
          ssr: 0.2
          epoch_beg: 10
          epoch_end: 26
          update_interval: 4
      # gradient clipping (norm)
      clip_gradient: 5
      # for early stop detection
      no_impr: 6
      no_impr_thres: 0.01
      # report metrics on validation epoch
      report_metrics: ["loss", "accu", "@ctc"]
      stop_criterion: "accu"
      average_checkpoint: false
  • enh_transform: Feature configurations for enhancement/separation tasks. Refer aps/transform/enh.py for all the supported parameters. An example that uses log spectrogram feature concatenated with cos-IPDs:

    enh_transform:
      feats: spectrogram-log-ipd
      frame_len: 512
      frame_hop: 256
      window: sqrthann
      center: true
      # librosa or kaldi
      mode: librosa
      ipd_index: 0,1;0,2;0,3;0,4
      cos_ipd: true
  • asr_transform: Feature configurations for ASR tasks. Refer aps/transform/asr.py for all the supported parameters. For examples, if we want to adopt log-fbank feature with spectral augumentation (SpecAugment) and utterance-level mean variance normalization, we have:

    asr_transform:
      feats: perturb-fbank-log-cmvn-aug
      frame_len: 400
      frame_hop: 160
      window: hamm
      center: false
      pre_emphasis: 0.97
      mode: librosa
      round_pow_of_two: true
      sr: 16000
      num_mels: 80
      norm_mean: true
      norm_var: true
      speed_perturb: 0.9,1.0,1.1
      aug_prob: 1
      aug_time_args: [100, 1]
      aug_freq_args: [27, 1]
      aug_mask_zero: true

    Note that the above configuration is used with raw waveform dataloader. For kaldi format features, the configuration is more simple (only needs to apply cmvn and SpecAugment):

    asr_transform:
      feats: cmvn-aug
      sr: 16000
      norm_mean: true
      norm_var: true
      gcmvn: data/aishell_v1/train/cmvn.ark
      aug_prob: 1
      aug_time_args: [100, 1]
      aug_freq_args: [27, 1]
      aug_mask_zero: true

The other options, e.g., batch size, number of the epochs, are passed as the script arguments. See aps/scripts/train.sh and aps/scripts/distributed_train.sh.