Skip to content

Audio Spectrogram Transformer for accent classification

License

Notifications You must be signed in to change notification settings

aryamtos/ast-brazilian-portuguese

 
 

Repository files navigation

AST: Audio Spectrogram Transformer - Transfer Learning for Accent Classification

Introduction

Illustration of AST.

This repository contains Ariadne Matos fork of the official implementation (in PyTorch) of the Audio Spectrogram Transformer (AST) proposed in the Interspeech 2021 paper AST: Audio Spectrogram Transformer (Yuan Gong, Yu-An Chung, James Glass).

AST is the first convolution-free, purely attention-based model for audio classification which supports variable length input and can be applied to various tasks.

Getting Started

Step 1. Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvast
source venvast/bin/activate
pip install -r requirements.txt 

Step 2. Test the AST model.

ASTModel(label_dim=527, \
         fstride=10, tstride=10, \
         input_fdim=128, input_tdim=1024, \
         imagenet_pretrain=True, audioset_pretrain=False, \
         model_size='base384')

Parameters:
label_dim : The number of classes (default:527).
fstride: The stride of patch spliting on the frequency dimension, for 16*16 patchs, fstride=16 means no overlap, fstride=10 means overlap of 6 (used in the paper). (default:10)
tstride: The stride of patch spliting on the time dimension, for 16*16 patchs, tstride=16 means no overlap, tstride=10 means overlap of 6 (used in the paper). (default:10)
input_fdim: The number of frequency bins of the input spectrogram. (default:128)
input_tdim: The number of time frames of the input spectrogram. (default:1024, i.e., 10.24s)
imagenet_pretrain: If True, use ImageNet pretrained model. (default: True, we recommend to set it as True for all tasks.)
audioset_pretrain: IfTrue, use full AudioSet And ImageNet pretrained model. Currently only support base384 model with fstride=tstride=10. (default: False, we recommend to set it as True for all tasks except AudioSet.)
model_size: The model size of AST, should be in [tiny224, small224, base224, base384] (default: base384).

Input: Tensor in shape [batch_size, temporal_frame_num, frequency_bin_num]. Note: the input spectrogram should be normalized with dataset mean and std, see here.
Output: Tensor of raw logits (i.e., without Sigmoid) in shape [batch_size, label_dim].

cd ast/src
python
import os 
import torch
from models import ASTModel 
# download pretrained model in this directory
os.environ['TORCH_HOME'] = '../pretrained_models'  
# assume each input spectrogram has 100 time frames
input_tdim = 100
# assume the task has 527 classes
label_dim = 527
# create a pseudo input: a batch of 10 spectrogram, each with 100 time frames and 128 frequency bins 
test_input = torch.rand([10, input_tdim, 128]) 
# create an AST model
ast_mdl = ASTModel(label_dim=label_dim, input_tdim=input_tdim, imagenet_pretrain=True)
test_output = ast_mdl(test_input) 
# output should be in shape [10, 527], i.e., 10 samples, each with prediction of 527 classes. 
print(test_output.shape)  

Use Pretrained Model For Downstream Tasks

You can use the pretrained AST model for your own dataset. There are two ways to doing so.

You can of course only take ast/src/models/ast_models.py, set audioset_pretrain=True, and use it with your training pipeline, the only thing need to take care of is the input normalization, we normalize our input to 0 mean and 0.5 std. To use the pretrained model, you should roughly normalize the input to this range. You can check ast/src/get_norm_stats.py to see how we compute the stats, or you can try using our AudioSet normalization input_spec = (input_spec + 4.26) / (4.57 * 2). Using your own training pipeline might be easier if you already have a good one. Please note that AST needs smaller learning rate (we use 10 times smaller learning rate than our CNN model proposed in the PSLA paper) and converges faster, so please search the learning rate and learning rate scheduler for your task.

If you want to use our training pipeline, you would need to modify below for your new dataset.

  1. You need to create a json file, and a label index for your dataset, see ast/egs/audioset/data/ for an example.
  2. In /your_dataset/run.sh, you need to specify the data json file path. You need to set dataset_mean and dataset_std, if don't know, you can use our AudioSet stats (mean=-4.27, std=4.57); You need to set audio_length, which should be the number of frames (e.g., with a 10ms hop, 10-second audio=1000 frames); You need to set the metrics in [acc,mAP] and loss in [CE,BCE]; You need to set the inital learning rate lr and learning rate scheduler lrscheduler_{start,step,decay}; You also need to set the SpecAug parameters (freqm and timem, we recommend to mask 48 frequency bins out of 128, and 20% of your time frames), the mixup rate (i.e., how many samples are mixup samples), batch size, etc. While it seems a lot, it is easy if you start with one of our recipe: ast/egs/[audioset,esc50,speechcommands]/run.sh].

To summarize, to use our training pipeline, you need to creat data files and modify the shell script. You can refer to our ESC-50 and Speechcommands recipes.

Also, please note that we use 16kHz audios for the pretrained model, so if you want to use the pretrained model, please prepare your data in 16kHz.

Citing

The first paper proposes the Audio Spectrogram Transformer while the second paper describes the training pipeline that we applied on AST to achieve the new state-of-the-art on AudioSet.

@inproceedings{gong21b_interspeech,
  author={Yuan Gong and Yu-An Chung and James Glass},
  title={{AST: Audio Spectrogram Transformer}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={571--575},
  doi={10.21437/Interspeech.2021-698}
}
@ARTICLE{gong_psla, 
    author={Gong, Yuan and Chung, Yu-An and Glass, James},  
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},   
    title={PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation},   
    year={2021}, 
    doi={10.1109/TASLP.2021.3120633}
}

Contact

If you have a question, or just want to share how you have use this, send me an email at ariadnenascime6@gmail.com

About

Audio Spectrogram Transformer for accent classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Shell 1.5%