Skip to content

Commit

Permalink
[feature] add support for gemini-dfresnet (#291)
Browse files Browse the repository at this point in the history
* [feature] add support for gemini-dfresnet

* fix lint errors

* add warmup of 6 epochs to config

* add warmup of 6 epochs to config

* add the results of gemini-df-resnet

* update the link for gemini models
  • Loading branch information
wsstriving authored Apr 25, 2024
1 parent 0593804 commit a3a046e
Show file tree
Hide file tree
Showing 7 changed files with 371 additions and 13 deletions.
22 changes: 12 additions & 10 deletions docs/pretrained.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,16 @@ in [the voxconverse recipe](https://github.com/wenet-e2e/wespeaker/tree/master/e

## Model List

| Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
|-----------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip) | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip) | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip) | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip) | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip) | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx) |
| [CNCeleb](../examples/cnceleb/v2/README.md) | CN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx) |
| Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
|--- |--- |--- |--- |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet34_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.zip)| [ResNet152_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet152_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.zip)| [ResNet221_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet221_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.zip)| [ResNet293_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_resnet293_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.zip) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.zip) | [CAM++](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++.onnx) / [CAM++_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_CAM++_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.zip) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.zip) | [ECAPA512](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512.onnx) / [ECAPA512_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA512_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.zip) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.zip) | [ECAPA1024](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024.onnx) / [ECAPA1024_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_ECAPA1024_LM.onnx) |
| [VoxCeleb](../examples/voxceleb/v2/README.md) | EN | [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.zip)| [Gemini_DFResnet114_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/voxceleb/voxceleb_gemini_dfresnet114_LM.onnx) |
| [CNCeleb](../examples/cnceleb/v2/README.md) | CN | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.zip) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.zip) | [ResNet34](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34.onnx) / [ResNet34_LM](https://wespeaker-1256283475.cos.ap-shanghai.myqcloud.com/models/cnceleb/cnceleb_resnet34_LM.onnx) |


9 changes: 6 additions & 3 deletions examples/cnceleb/v3_finetune/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
## Fine-tuning Results Based on DINO

* Setup: fbank80, num_frms200, epoch75 (pretrain), epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link): teacher models extracted from `model_75.pt` (please refer to `wespeaker/ssl/bin/average_dino_model.py` for information on the extraction process)
* Setup: fbank80, num_frms200, epoch50 (finetune), ArcMargin, aug_prob0.6, speed_perturb (no spec_aug)
* test_trials: CNC-Eval-Avg.lst
* These results are obtained by pretraining on different datasets and then finetuning with CNCeleb.


| Model | Params | FLOPs | Pretraining Data | LM | AS-Norm | EER (%) | minDCF (p=0.01) |
| :------------------------------ | :-----: | :-----: | :--------------------: | :-: | :-------: | :-------: | :--------------: |
| ECAPA_TDNN_GLOB_c1024-ASTP-emb192 | 14.65M | 2.65 G | CNCeleb | × | × | 8.217 | 0.439 |
Expand All @@ -20,3 +18,8 @@
* 🔥 UPDATE 2024.03: We support finetuning DINO-based self-supervised models, which is trained on the WenetSpeech dataset. Pretrained Paper related to the finetuning results:
* [WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition](https://arxiv.org/pdf/2110.03370.pdf)
* [Leveraging In-the-wild Data for Effective Self-supervised Pretraining in Speaker Recognition](https://arxiv.org/pdf/2309.11730.pdf)

## Resources
* [Pre-trained ECAPA-TDNN checkpoints](https://drive.google.com/drive/folders/1XDIUjnKPrvJE5auBWT5CcE4mqcglCwzq?usp=drive_link)
* [The filtering metadata for wenetspeech](https://drive.google.com/file/d/1UaGuyT1wcKc5g9vRdfIBvLoDRcuOxBlX/view?usp=drive_link)

4 changes: 4 additions & 0 deletions examples/voxceleb/v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@
| | | ||| 0.744 | 0.896 | 1.603 |
| Res2Net34_Base | 4.68M | 1.77G | × | × | 1.351 | 1.347 | 2.478 |
| | | | × || 1.234 | 1.232 | 2.162 |
| Gemini_DFResNet114 | 6.53M | 5.42G | × | × | 0.787 | 0.963 | 1.760 |
| | | | × || 0.707 | 0.889 | 1.546 |
| | | || x | 0.771 | 0.906 | 1.599 |
| | | ||| 0.638 | 0.839 | 1.427 |


## PLDA results
Expand Down
81 changes: 81 additions & 0 deletions examples/voxceleb/v2/conf/gemini_dfresnet_adam.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
### train configuraton

exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165
gpus: "[0,1]"
num_avg: 2
enable_amp: False # whether enable automatic mixed precision training

seed: 42
num_epochs: 165
save_epoch_interval: 5 # save model every 5 epochs
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 128
num_workers: 8
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 0
shuffle: True
shuffle_args:
shuffle_size: 2500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 800
resample_rate: 16000
speed_perturb: True
num_frms: 200
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 80
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: Gemini_DF_ResNet114 # Gemini_DF_ResNet60 Gemini_DF_ResNet114 GemGemini_DF_ResNet183 Gemini_DF_ResNet237
model_init: null
model_args:
feat_dim: 80
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "arc_margin" # add_margin, arc_margin, sphere, sphereface2, softmax, arc_margin_intertopk_subcenter
scale: 32.0
easy_margin: False

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.2
final_margin: 0.2
increase_start_epoch: 20
fix_start_epoch: 40
update_margin: False
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: AdamW
optimizer_args:
weight_decay: 0.05

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 0.000125
final_lr: 0.000001
warm_up_epoch: 6
warm_from_zero: False
91 changes: 91 additions & 0 deletions examples/voxceleb/v2/conf/gemini_dfresnet_sgd_lm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
### Large margin fine-tuning configuration
#
# The large margin fine-tuning operation is often used in speaker
# verification challenge system to further improve the performance.
# In this fine-tuning stage, large margin and longer segment will
# be used.

exp_dir: exp/Gemini_DF_ResNet114-TSTP-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165-LM
gpus: "[0,1]"
num_avg: 1
enable_amp: False # whether enable automatic mixed precision training
do_lm: True

seed: 42
num_epochs: 5
save_epoch_interval: 1 # save model per epoch
log_batch_interval: 100 # log every 100 batchs

dataloader_args:
batch_size: 32
num_workers: 8
pin_memory: False
prefetch_factor: 8
drop_last: True

dataset_args:
# the sample number which will be traversed within one epoch, if the value equals to 0,
# the utterance number in the dataset will be used as the sample_num_per_epoch.
sample_num_per_epoch: 0
shuffle: True
shuffle_args:
shuffle_size: 2500
filter: True
filter_args:
min_num_frames: 100
max_num_frames: 800
resample_rate: 16000
speed_perturb: True
num_frms: 600
aug_prob: 0.6 # prob to add reverb & noise aug per sample
fbank_args:
num_mel_bins: 80
frame_shift: 10
frame_length: 25
dither: 1.0
spec_aug: False
spec_aug_args:
num_t_mask: 1
num_f_mask: 1
max_t: 10
max_f: 8
prob: 0.6

model: Gemini_DF_ResNet114 # ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
model_init: null
model_args:
feat_dim: 80
embed_dim: 256
pooling_func: "TSTP" # TSTP, ASTP, MQMHASTP
two_emb_layer: False
projection_args:
project_type: "arc_margin" # add_margin, arc_margin, sphere, softmax, arc_margin_intertopk_subcenter
scale: 32.0
easy_margin: False

margin_scheduler: MarginScheduler
margin_update:
initial_margin: 0.5
final_margin: 0.5
increase_start_epoch: 1
fix_start_epoch: 1
update_margin: True
increase_type: "exp" # exp, linear

loss: CrossEntropyLoss
loss_args: {}

optimizer: SGD
optimizer_args:
momentum: 0.9
nesterov: True
weight_decay: 0.0001

scheduler: ExponentialDecrease
scheduler_args:
initial_lr: 1.0e-4
final_lr: 2.5e-5
warm_up_epoch: 1
warm_from_zero: True


Loading

0 comments on commit a3a046e

Please sign in to comment.