The instruction on data format & configuration rules.
The APS has its own module to perform feature extraction (both ASR and enhancement/separation task are supported, see aps.transform
) so it will not take us much time on data preparation.
-
kaldi feature dataloader. Considering many people use kaldi features before, APS still provide the dataloader to support kaldi-style features. It requires
feats.scp
,utt2num_frames
,text
(or other name based on the tokenization method).feats.scp
andutt2num_frames
are generated by Kaldi toolkit, e.g.,steps/make_{mfcc,fbank}.sh
, whiletext
is tokenized from the transcriptions (sequences of character, phoneme, word, word pieces .etc), following the format in Kaldi:BAC009S0002W0126 仅 一 个 多 月 的 时 间 里 BAC009S0002W0127 除 了 北 京 上 海 广 州 深 圳 四 个 一 线 城 市 和 三 亚 之 外 BAC009S0002W0128 四 十 六 个 限 购 城 市 当 中 BAC009S0002W0129 四 十 一 个 已 正 式 取 消 或 变 相 放 松 了 限 购 BAC009S0002W0130 财 政 金 融 政 策 紧 随 其 后 而 来 BAC009S0002W0131 显 示 出 了 极 强 的 威 力
for Chinese character and
1230-139216-0040 ▁GIVE ▁POST ▁OFFICE ▁PEOPLE ▁ORDER S ▁NOT ▁TO ▁LET ▁THIS ▁OUT 1230-139225-0004 ▁IT ▁CONVEY S ▁NOTHING ▁TO ▁ME 1230-139225-0007 ▁MISTER ▁PORT LE THORPE ▁LOOK ED ▁AND ▁WAS ▁STARTLE D ▁OUT ▁OF ▁HIS ▁PE E V ISH NESS 1230-139225-0015 ▁ARE ▁ONE ▁AND ▁THE ▁SAME ▁MAN ▁OR ▁I ▁SHOULD ▁SAY 1230-139225-0021 ▁SO ▁NOW ▁COME ▁TO ▁YOUR ▁BED S
for English word pieces.
utt2num_frames
is used to sort utterances to form the mini-batch that minimized the padding size. APS adopts my another package kaldi-python-io as the backend of Kaldi IO. -
raw waveform dataloader (recommonded by the author). It requires
wav.scp
,utt2dur
,text
.wav.scp
follows the definition in Kaldi, e.g.,BAC009S0002W0126 /scratch/jwu/aishell_v1/train/BAC009S0002W0126.wav BAC009S0002W0127 /scratch/jwu/aishell_v1/train/BAC009S0002W0127.wav BAC009S0002W0128 /scratch/jwu/aishell_v1/train/BAC009S0002W0128.wav BAC009S0002W0129 /scratch/jwu/aishell_v1/train/BAC009S0002W0129.wav BAC009S0002W0130 /scratch/jwu/aishell_v1/train/BAC009S0002W0130.wav BAC009S0002W0131 /scratch/jwu/aishell_v1/train/BAC009S0002W0131.wav
and
utt2dur
prescribes the duration of each utterance, e.g.,:BAC009S0002W0126 2.5760 BAC009S0002W0127 5.8900 BAC009S0002W0128 3.3930 BAC009S0002W0129 6.1360 BAC009S0002W0130 4.1300 BAC009S0002W0131 4.4060
We provide script
scripts/get_wav_dur.sh
to generate the duration file. Note that APS also supports "pipe" or "archive" style input, so thewav.scp
like:FAEM0_SI762 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SI762.WAV -t wav - | FAEM0_SX132 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX132.WAV -t wav - | FAEM0_SX222 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX222.WAV -t wav - | FAEM0_SX312 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX312.WAV -t wav - | FAEM0_SX402 sox /scratch/jwu/TIMIT-LDC93S1/TIMIT/TRAIN/DR2/FAEM0/SX402.WAV -t wav - |
and
100-121669-0004 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1072576 100-121669-0005 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:1556796 100-121669-0006 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2025816 100-121669-0007 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2306036 100-121669-0008 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2473296 100-121669-0009 /scratch/jwu/LibriSpeech/train-clean-360/wav.1.ark:2591436
are both OK. The script
scripts/archive_wav.sh
can used to archive the small audio files if needed. Thetext
is same as the one used inam_kaldi
dataloader.utt2dur
is also used to sort utterances to form the mini-batch that minimized the padding size.
The dictionary required by AM training is a text file where each line contains <model-unit> <int-id>
pair. The following is the example from Librispeech dataset which using 6K word pieces as the model units:
<unk> 0
<sos> 1
<eos> 2
▁THE 3
S 4
▁AND 5
ED 6
▁OF 7
▁TO 8
...
...
CORPOREAL 5998
▁CHOCOLATE 5999
utils/tokenizer.py
can be used to tokenize the transcriptions and generate the dictionary.
-
chunk dataloader only requires several wave script, e.g.,
mix.scp,spk{1,2}.scp
for separation tasks. Each one follows Kaldi's style. The dataloader will split the utterances into fixed length audio chunks on-the-fly. -
online dataloader simulates and splits the audio on-the-fly and it only requires one configuration script for data simulation. The format of each line follows the pattern
<key> <command-options>
. See wav_simulate.py for details of the<command-options>
:usage: wav_simulate.py [-h] [--dump-ref-dir DUMP_REF_DIR] --src-spk SRC_SPK [--src-rir SRC_RIR] [--src-sdr SRC_SDR] [--src-begin SRC_BEGIN] [--point-noise POINT_NOISE] [--point-noise-rir POINT_NOISE_RIR] [--point-noise-snr POINT_NOISE_SNR] [--point-noise-begin POINT_NOISE_BEGIN] [--point-noise-offset POINT_NOISE_OFFSET] [--point-noise-repeat POINT_NOISE_REPEAT] [--isotropic-noise ISOTROPIC_NOISE] [--isotropic-noise-snr ISOTROPIC_NOISE_SNR] [--isotropic-noise-offset ISOTROPIC_NOISE_OFFSET] [--dump-channel DUMP_CHANNEL] [--norm-factor NORM_FACTOR] [--sr SR] mix
For example:
two_speaker_mix01 --src-spk /path/to/1462-170142-0008.wav,/path/to/1462-170142-0006.wav --src-begin 0,0 --src-sdr 3 two_speaker_mix02 --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0003.wav --src-begin 4000,0 --src-sdr 1 two_speaker_mix03 --src-spk /path/to/1462-170142-0009.wav,/path/to/1462-170142-0008.wav --src-begin 0,900 --src-sdr -1 two_speaker_mix04 --src-spk /path/to/1462-170142-0002.wav,/path/to/1462-170142-0001.wav --src-begin 0,1800 --src-sdr 0
For more details on
<command-options>
, see setk's data simulation
Almost all the hyper-parameters are configured in the yaml files. You can check the examples in conf. The followings are allowed configuration keys:
-
nnet
andnnet_conf
: Name of the networks and it's parameters. Several supported networks have been registered using decorator classApsRegisters
, e.g.,@ApsRegisters.asr.register("att")
(refer to original code in aps/asr and aps/sse). An example to train attention based acoustic model:nnet: "att" # network's registered name nnet_conf: # parameters of the network input_size: 80 enc_type: "concat" enc_proj: 1024 enc_kwargs: conv2d: out_features: -1 channel: 32 num_layers: 2 stride: 2 padding: 1 kernel_size: 3 pytorch_rnn: hidden: 1024 dropout: 0.3 num_layers: 4 bidirectional: true dec_dim: 1024 dec_kwargs: dec_rnn: "lstm" rnn_layers: 2 rnn_hidden: 1024 rnn_dropout: 0.2 emb_dropout: 0.2 dropout: 0.2 input_feeding: true vocab_embeded: true att_type: "ctx" att_kwargs: att_dim: 1024
and a DCCRN (denoising) network:
nnet: dccrn nnet_conf: cplx: true K: "3,3;3,3;3,3;3,3;3,3;3,3;3,3" S: "2,1;2,1;2,1;2,1;2,1;2,1;2,1" P: "1,1,1,1,1,0,0" O: "0,0,0,0,0,0,1" C: "16,32,64,64,128,128,256" num_spks: 1 rnn_resize: 512 non_linear: tanh connection: cat
-
task
andtask_conf
: Name of the tasks and it's parameters. The supported tasks have been decorated using@ApsRegisters.task.register("...")
(see aps/task). An example for permutation invariant training using SiSNR (scale-invariant SNR) objective function:task: "sisnr" task_conf: num_spks: 2 permute: true zero_mean: false
and jointly CTC and cross-entropy training for acoustic model:
task: ctc_xent task_conf: # CTC weight ctc_weight: 0.2 # label smoothing factor lsm_factor: 0.1 # label smoothing method lsm_method: "uniform"
-
data_conf
: Parameters for dataloader. An example of raw waveform dataloader for AM training:data_conf: fmt: "am@raw" loader: # adaptive or constraint batch_mode: adaptive max_token_num: 400 max_dur: 30 # (s) min_dur: 0.4 # (s) # for adaptive one, batch number will halve when # 1) #token_num > adapt_token_num # 2) #utt_dur > adapt_dur adapt_token_num: 150 adapt_dur: 10 # (s) # for constraint one, batch number is the # maximum number that satisfies #utt_dur <= batch_size train: wav_scp: "data/aishell_v1/train/wav.scp" utt2dur: "data/aishell_v1/train/utt2dur" text: "data/aishell_v1/train/text" valid: wav_scp: "data/aishell_v1/dev/wav.scp" utt2dur: "data/aishell_v1/dev/utt2dur" text: "data/aishell_v1/dev/text"
To use the kaldi feature, we have
data_conf: fmt: "am@kaldi" loader: # adaptive or constraint batch_mode: adaptive max_token_num: 400 max_dur: 3000 #num_frames min_dur: 100 #num_frames adapt_token_num: 150 adapt_dur: 1000 #num_frames train: feats_scp: "data/aishell_v1/train/feats.scp" utt2num_frames: "data/aishell_v1/train/utt2num_frames" text: "data/aishell_v1/train/text" valid: feats_scp: "data/aishell_v1/dev/feats.scp" utt2num_frames: "data/aishell_v1/dev/utt2num_frames" text: "data/aishell_v1/dev/text"
And the chunk dataloader for separation model training:
data_conf: fmt: "ss@chunk" loader: # chunk size (in samples) chunk_size: 32000 # sample rate sr: 8000 train: mix_scp: "data/wsj0_2mix/tr/mix.scp" ref_scp: "data/wsj0_2mix/tr/spk1.scp,data/wsj0_2mix/tr/spk2.scp" valid: mix_scp: "data/wsj0_2mix/cv/mix.scp" ref_scp: "data/wsj0_2mix/cv/spk1.scp,data/wsj0_2mix/cv/spk2.scp"
-
trainer_conf
: Parameters forTrainer
class, including optimizer, learning rate scheduler, schedule sampling scheduler (if necessary). For example:trainer_conf: # optimizer and parameters optimizer: "adam" optimizer_kwargs: lr: 1.0e-3 weight_decay: 1.0e-5 lr_scheduler: "reduce_lr" # run lr_scheduler every epoch or step lr_scheduler_period: "epoch" # or "step" lr_scheduler_kwargs: min_lr: 1.0e-8 patience: 1 factor: 0.5 # scheduler sampling, for AM training only ss_scheduler: "linear" ss_scheduler_kwargs: ssr: 0.2 epoch_beg: 10 epoch_end: 26 update_interval: 4 # gradient clipping (norm) clip_gradient: 5 # for early stop detection no_impr: 6 no_impr_thres: 0.01 # report metrics on validation epoch report_metrics: ["loss", "accu", "@ctc"] stop_criterion: "accu" average_checkpoint: false
-
enh_transform
: Feature configurations for enhancement/separation tasks. Referaps/transform/enh.py
for all the supported parameters. An example that uses log spectrogram feature concatenated with cos-IPDs:enh_transform: feats: spectrogram-log-ipd frame_len: 512 frame_hop: 256 window: sqrthann center: true # librosa or kaldi mode: librosa ipd_index: 0,1;0,2;0,3;0,4 cos_ipd: true
-
asr_transform
: Feature configurations for ASR tasks. Referaps/transform/asr.py
for all the supported parameters. For examples, if we want to adopt log-fbank feature with spectral augumentation (SpecAugment) and utterance-level mean variance normalization, we have:asr_transform: feats: perturb-fbank-log-cmvn-aug frame_len: 400 frame_hop: 160 window: hamm center: false pre_emphasis: 0.97 mode: librosa round_pow_of_two: true sr: 16000 num_mels: 80 norm_mean: true norm_var: true speed_perturb: 0.9,1.0,1.1 aug_prob: 1 aug_time_args: [100, 1] aug_freq_args: [27, 1] aug_mask_zero: true
Note that the above configuration is used with raw waveform dataloader. For kaldi format features, the configuration is more simple (only needs to apply cmvn and SpecAugment):
asr_transform: feats: cmvn-aug sr: 16000 norm_mean: true norm_var: true gcmvn: data/aishell_v1/train/cmvn.ark aug_prob: 1 aug_time_args: [100, 1] aug_freq_args: [27, 1] aug_mask_zero: true
The other options, e.g., batch size, number of the epochs, are passed as the script arguments. See aps/scripts/train.sh
and aps/scripts/distributed_train.sh
.