- This repo is a legacy version of when the Mockingjay paper is first released.
- For our improved and maintaining implementation of Mockingjay, please visit the The S3PRL project.
This is an open source project for Mockingjay, an unsupervised algorithm for learning speech representations introduced and described in the paper "Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders".
Feel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact f07942089@ntu.edu.tw. If you find this project helpful for your research, please do consider to cite this paper, thanks!
You can find pre-trained models here:
http://bit.ly/result_mockingjay
Their usage are explained bellow and furthur in Step 3 of the Instruction Section.
With this repo and the trained models, you can use it to extract speech representations from your target dataset. To do so, feed-forward the trained model on the target dataset and retrieve the extracted features by running the following example python code (example_extract.py):
import torch
from runner_mockingjay import get_mockingjay_model
example_path = 'result/result_mockingjay/mockingjay_libri_sd1337_LinearLarge/mockingjay-500000.ckpt'
mockingjay = get_mockingjay_model(from_path=example_path)
# A batch of spectrograms: (batch_size, seq_len, hidden_size)
spec = torch.zeros(3, 800, 160)
# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)
# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=False)
# reps.shape: (batch_size, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=True)
# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=False)
spec
is the input spectrogram of the mockingjay model where:
spec
needs to be a PyTorch tensor with shape of(seq_len, mel_dim)
or(batch_size, seq_len, mel_dim)
.mel_dim
is the spectrogram feature dimension which by default ismel_dim == 160
, see utility/audio.py for more preprocessing details.
reps
is a PyTorch tensor of various possible shapes where:
batch_size
is the inference batch size.num_hiddem_layers
is the transformer encoder depth of the mockingjay model.seq_len
is the maximum sequence length in the batch.downsample_rate
is the dimensionality of the transformer encoder layers.hidden_size
is the number of stacked consecutive features vectors to reduce the length of input sequences.
The output shape of reps
is determined by the two arguments:
all_layers
is a boolean which controls whether to output all the Encoder layers, ifFalse
returns the hidden of the last Encoder layer.tile
is a boolean which controls whether to tile representations to match the inputseq_len
ofspec
.
As you can see, reps
is essentially the Transformer Encoder hidden representations in the mockingjay model. You can think of Mockingjay as a speech version of BERT if you are familiar with it.
There are many ways to incorporate reps
into your downtream task. One of the easiest way is to take only the outputs of the last Encoder layer (i.e., all_layers=False
) as the input features to your downstream model, feel free to explore other mechanisms.
With this repo and the trained models, you can fine-tune the pre-trained Mockingjay model on your own dataset and tasks. To do so, take a look at the following example python code (example_finetune.py):
import torch
from runner_mockingjay import get_mockingjay_model
from downstream.model import example_classifier
from downstream.solver import get_mockingjay_optimizer
# setup the mockingjay model
example_path = 'result/result_mockingjay/mockingjay_libri_sd1337_MelBase/mockingjay-500000.ckpt'
solver = get_mockingjay_model(from_path=example_path)
# setup your downstream class model
# features extracted from MelBase model have dimention 768
classifier = example_classifier(input_dim=768, hidden_dim=128, class_num=2).cuda()
# construct the Mockingjay optimizer
params = list(solver.mockingjay.named_parameters()) + list(classifier.named_parameters())
optimizer = get_mockingjay_optimizer(params=params, lr=4e-3, warmup_proportion=0.7, training_steps=50000)
# forward
example_inputs = torch.zeros(3, 800, 160) # A batch of spectrograms: (batch_size, seq_len, hidden_size)
reps = solver.forward_fine_tune(spec=example_inputs) # returns: (batch_size, seq_len, hidden_size)
loss = classifier(reps, torch.LongTensor([0, 1, 0]).cuda())
# update
loss.backward()
optimizer.step()
# save
PATH_TO_SAVE_YOUR_MODEL = 'example.ckpt'
states = {'Classifier': classifier.state_dict(), 'Mockingjay': solver.mockingjay.state_dict()}
torch.save(states, PATH_TO_SAVE_YOUR_MODEL)
- Python 3
- Pytorch 1.3.0 or above
- Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
- Required packages and their use are listed below, and also in requirements.txt:
editdistance # error rate calculation
joblib # parallel feature extraction & decoding
librosa # feature extraction (for feature extraction only)
pydub # audio segmentation (for MOSEI dataset preprocessing only)
pandas # data management
tensorboardX # logger & monitor
torch # model & learning
tqdm # verbosity
yaml # config parser
matplotlib # visualization
ipdb # optional debugger
numpy # array computation
scipy # for feature extraction
The above packages can be installed by the command:
pip install -r requirements.txt
Below we list packages that need special attention, and we recommand you to install them manually:
apex # non-essential, faster optimization (only needed if enabled in config)
sentencepiece # sub-word unit encoding (for feature extraction only, see https://github.com/google/sentencepiece#build-and-install-sentencepiece for install instruction)
Before you start, make sure all the packages required listed above are installed correctly
See the instructions on the Preprocess wiki page for preprocessing instructions.
All the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and massive experiment and can be managed easily this way. See config files for the exact format and examples.
Once the config file is ready, run the following command to train unsupervised end-to-end Mockingjay:
python3 runner_mockingjay.py --train
All settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard.
Once a Mockingjay model was trained, we can use the generated representations on downstream tasks. See the Experiment section for reproducing downstream task results mentioned in our paper, and see the Highlight section for incorporating the extracted representations with your own downstream task.
Pre-trained models and their configs can be download from HERE.
To load with default path, models should be placed under the directory path: --ckpdir=./result_mockingjay/
and name the model file manually with --ckpt=
.
Run the following command to visualize the model generated samples:
# visualize hidden representations
python3 runner_mockingjay.py --plot
# visualize spectrogram
python3 runner_mockingjay.py --plot --with_head
Note that the arguments --ckpdir=XXX --ckpt=XXX
needs to be set correctly for the above command to run properly.
# open TensorBoard to see log
tensorboard --logdir=log/log_mockingjay/mockingjay_libri_sd1337/
# or
python3 -m tensorboard.main --logdir=log/log_mockingjay/mockingjay_libri_sd1337/
See the instructions on the Downstream wiki page to reproduce our experiments.
See the instructions on the APC wiki page to reproduce our experiments.
- Montreal Forced Aligner, McAuliffe et. al.
- CMU MultimodalSDK, Amir Zadeh.
- PyTorch Transformers, Hugging Face.
- Autoregressive Predictive Coding, Yu-An Chung.
- End-to-end ASR Pytorch, Alexander-H-Liu.
- Tacotron Preprocessing, Ryuichi Yamamoto (r9y9)
@article{Liu_2020,
title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
ISBN={9781509066315},
url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
DOI={10.1109/icassp40776.2020.9054458},
journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
publisher={IEEE},
author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
year={2020},
month={May}
}