[feature] add support for gemini-dfresnet (#291)

* [feature] add support for gemini-dfresnet * fix lint errors * add warmup of 6 epochs to config * add warmup of 6 epochs to config * add the results of gemini-df-resnet * update the link for gemini models
wenet-e2e · Apr 25, 2024 · a3a046e · a3a046e
1 parent 0593804
commit a3a046e
Show file tree

Hide file tree

Showing 7 changed files with 371 additions and 13 deletions.
diff --git a/docs/pretrained.md b/docs/pretrained.md
@@ -39,14 +39,16 @@ in [the voxconverse recipe](https://github.com/wenet-e2e/wespeaker/tree/master/e
 
 ## Model List
 
-| Datasets                                      | Languages | Checkpoint (pt)                                                                                                                                                                                                                     | Runtime Model (onnx)                                                                                                                                                                                                                  |
-|-----------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip)     | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx)     |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip)                                                                                                                 | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx)                                                                                                                  |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip)                                                                                                                 | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx)                                                                                                                  |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip)                                                                                                                 | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx)                                                                                                                  |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip)                 | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx)                 |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip)     | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx)     |
-| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN        | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx) |
-| [CNCeleb](../examples/cnceleb/v2/README.md)   | CN        | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip)         | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx)         |
+| Datasets  | Languages     |  Checkpoint (pt) | Runtime Model (onnx)     |
+|---    |---    |---   |---   |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip)| [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip)| [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip)| [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip) | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip) | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx)  |
+| [VoxCeleb](../examples/voxceleb/v2/README.md)   | EN    | [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.zip)| [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.onnx)  |
+| [CNCeleb](../examples/cnceleb/v2/README.md)   | CN    | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip)  | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx) |
+
 
diff --git a/examples/cnceleb/v3_finetune/README.md b/examples/cnceleb/v3_finetune/README.md
@@ -1,11 +1,9 @@
 ## Fine-tuning Results Based on DINO
 
-* Setup: fbank80, num_frms200, epoch75 (pretrain), epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
-* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link): teacher models extracted from `model_75.pt` (please refer to `wespeaker/ssl/bin/average_dino_model.py` for information on the extraction process)
+* Setup: fbank80, num_frms200, epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
 * test_trials: CNC-Eval-Avg.lst
 * These results are obtained by pretraining on different datasets and then finetuning with CNCeleb.
 
-
 | Model                             | Params  |  FLOPs  |    Pretraining Data    | LM  | AS-Norm   | EER (%)   | minDCF (p=0.01)  |
 | :------------------------------   | :-----: | :-----: | :--------------------: | :-: | :-------: | :-------: | :--------------: |
 | ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M  | 2.65 G  |        CNCeleb         | ×   | ×         | 8.217     | 0.439            |
@@ -20,3 +18,8 @@
 * 🔥 UPDATE 2024.03: We support finetuning DINO-based self-supervised models, which is trained on the WenetSpeech dataset. Pretrained Paper related to the finetuning results:
     * [WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition](https://arxiv.org/pdf/2110.03370.pdf)
     * [Leveraging In-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition](https://arxiv.org/pdf/2309.11730.pdf)
+
+## Resources
+* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link)
+* [The filtering metadata for wenetspeech](https://drive.google.com/file/d/1UaGuyT1wcKc5g9vRdfIBvLoDRcuOxBlX/view?usp=drive_link)
+
diff --git a/examples/voxceleb/v2/README.md b/examples/voxceleb/v2/README.md
@@ -47,6 +47,10 @@
 |                      |       |       | √ | √ | 0.744 | 0.896 | 1.603 |
 | Res2Net34_Base       | 4.68M | 1.77G | × | × | 1.351 | 1.347 | 2.478 |
 |                      |       |       | × | √ | 1.234 | 1.232 | 2.162 |
+| Gemini_DFResNet114   | 6.53M | 5.42G | × | × | 0.787 | 0.963 | 1.760 |
+|                      |       |       | × | √ | 0.707 | 0.889 | 1.546 |
+|                      |       |       | √ | x | 0.771 | 0.906 | 1.599 |
+|                      |       |       | √ | √ | 0.638 | 0.839 | 1.427 |
 
 
 ## PLDA results

diff --git a/examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml b/examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml
@@ -0,0 +1,81 @@
+### train configuraton
+
+exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165
+gpus: "[0,1]"
+num_avg: 2
+enable_amp: False # whether enable automatic mixed precision training
+
+seed: 42
+num_epochs: 165
+save_epoch_interval: 5 # save model every 5 epochs
+log_batch_interval: 100 # log every 100 batchs
+
+dataloader_args:
+  batch_size: 128
+  num_workers: 8
+  pin_memory: False
+  prefetch_factor: 8
+  drop_last: True
+
+dataset_args:
+  # the sample number which will be traversed within one epoch, if the value equals to 0,
+  # the utterance number in the dataset will be used as the sample_num_per_epoch.
+  sample_num_per_epoch: 0
+  shuffle: True
+  shuffle_args:
+    shuffle_size: 2500
+  filter: True
+  filter_args:
+    min_num_frames: 100
+    max_num_frames: 800
+  resample_rate: 16000
+  speed_perturb: True
+  num_frms: 200
+  aug_prob: 0.6 # prob to add reverb & noise aug per sample
+  fbank_args:
+    num_mel_bins: 80
+    frame_shift: 10
+    frame_length: 25
+    dither: 1.0
+  spec_aug: False
+  spec_aug_args:
+    num_t_mask: 1
+    num_f_mask: 1
+    max_t: 10
+    max_f: 8
+    prob: 0.6
+
+model: Gemini_DF_ResNet114 # Gemini_DF_ResNet60 Gemini_DF_ResNet114 GemGemini_DF_ResNet183 Gemini_DF_ResNet237
+model_init: null
+model_args:
+  feat_dim: 80
+  embed_dim: 256
+  pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
+  two_emb_layer: False
+projection_args:
+  project_type: "arc_margin" # add_margin, arc_margin, sphere, sphereface2, softmax, arc_margin_intertopk_subcenter
+  scale: 32.0
+  easy_margin: False
+
+margin_scheduler: MarginScheduler
+margin_update:
+  initial_margin: 0.2
+  final_margin: 0.2
+  increase_start_epoch: 20
+  fix_start_epoch: 40
+  update_margin: False
+  increase_type: "exp" # exp, linear
+
+loss: CrossEntropyLoss
+loss_args: {}
+
+optimizer: AdamW
+optimizer_args:
+  weight_decay: 0.05
+
+scheduler: ExponentialDecrease
+scheduler_args:
+  initial_lr: 0.000125
+  final_lr: 0.000001
+  warm_up_epoch: 6
+  warm_from_zero: False
diff --git a/examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml b/examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml
@@ -0,0 +1,91 @@
+### Large margin fine-tuning configuration
+#
+#   The large margin fine-tuning operation is often used in speaker
+#   verification challenge system to further improve the performance.
+#   In this fine-tuning stage, large margin and longer segment will
+#   be used.
+
+exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165-LM
+gpus: "[0,1]"
+num_avg: 1
+enable_amp: False # whether enable automatic mixed precision training
+do_lm: True
+
+seed: 42
+num_epochs: 5
+save_epoch_interval: 1 # save model per epoch
+log_batch_interval: 100 # log every 100 batchs
+
+dataloader_args:
+  batch_size: 32
+  num_workers: 8
+  pin_memory: False
+  prefetch_factor: 8
+  drop_last: True
+
+dataset_args:
+  # the sample number which will be traversed within one epoch, if the value equals to 0,
+  # the utterance number in the dataset will be used as the sample_num_per_epoch.
+  sample_num_per_epoch: 0
+  shuffle: True
+  shuffle_args:
+    shuffle_size: 2500
+  filter: True
+  filter_args:
+    min_num_frames: 100
+    max_num_frames: 800
+  resample_rate: 16000
+  speed_perturb: True
+  num_frms: 600
+  aug_prob: 0.6 # prob to add reverb & noise aug per sample
+  fbank_args:
+    num_mel_bins: 80
+    frame_shift: 10
+    frame_length: 25
+    dither: 1.0
+  spec_aug: False
+  spec_aug_args:
+    num_t_mask: 1
+    num_f_mask: 1
+    max_t: 10
+    max_f: 8
+    prob: 0.6
+
+model: Gemini_DF_ResNet114 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
+model_init: null
+model_args:
+  feat_dim: 80
+  embed_dim: 256
+  pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
+  two_emb_layer: False
+projection_args:
+  project_type: "arc_margin" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
+  scale: 32.0
+  easy_margin: False
+
+margin_scheduler: MarginScheduler
+margin_update:
+  initial_margin: 0.5
+  final_margin: 0.5
+  increase_start_epoch: 1
+  fix_start_epoch: 1
+  update_margin: True
+  increase_type: "exp" # exp, linear
+
+loss: CrossEntropyLoss
+loss_args: {}
+
+optimizer: SGD
+optimizer_args:
+  momentum: 0.9
+  nesterov: True
+  weight_decay: 0.0001
+
+scheduler: ExponentialDecrease
+scheduler_args:
+  initial_lr: 1.0e-4
+  final_lr: 2.5e-5
+  warm_up_epoch: 1
+  warm_from_zero: True
+
+