awesome-speech-recognition-speech-synthesis-papers

Paper List

Text-to-Audio
Automatic Speech Recognition(ASR)
Speaker Verification
Voice Conversion(VC)
Speech Synthesis(TTS)
Language Modelling
Confidence Estimates
Music Modelling
Interesting papers

Text to Audio

AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [pdf]
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [pdf]
MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [pdf]
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [pdf]
Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [pdf]

Automatic Speech Recognition

An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]
A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]
Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]
Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]
Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]
Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]
Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]
Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]
Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]
Speech recognition with weighted finite-state transducers(2001), M Mohri et al. [pdf]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]
The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]
Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]
Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]
Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al. [pdf]
Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]
Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]
Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]
Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]
Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]
Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]
Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]
Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]
Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]
Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]
A neural transducer(2015), N Jaitly et al. [pdf]
Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]
Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]
Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al. [pdf]
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding(2015), Y Miao et al. [pdf]
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al. [pdf]
Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]
Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]
Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]
Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]
End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]
End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al. [pdf]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]
Latent Sequence Decompositions(2016), William Chan et al. [pdf]
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]
Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]
Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]
Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]
Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]
Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]
An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]
A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]
An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]
Attention-Based End-to-End Speech Recognition in Mandarin(2017), C Shan et al. [pdf]
Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al. [pdf]
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]
English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition(2017), Chris Donahue et al. [pdf]
Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition(2017), Taesup Kim et al. [pdf]
Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al. [pdf]
Improving the Performance of Online Neural Transducer Models(2017), Tara N. Sainath et al. [pdf]
Learning Filterbanks from Raw Speech for Phone Recognition(2017), Neil Zeghidour et al. [pdf]
Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]
Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]
Multilingual Speech Recognition With A Single End-To-End Model(2017), Shubham Toshniwal et al. [pdf]
Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]
Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]
Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]
Robust Speech Recognition Using Generative Adversarial Networks(2017), Anuroop Sriram et al. [pdf]
State-of-the-art Speech Recognition With Sequence-to-Sequence Models(2017), Chung-Cheng Chiu et al. [pdf]
Towards Language-Universal End-to-End Speech Recognition(2017), Suyoun Kim et al. [pdf]
Accelerating recurrent neural network language model based online speech recognition system(2018), K Lee et al. [pdf]
An improved hybrid CTC-Attention model for speech recognition(2018), Zhe Yuan et al. [pdf]
Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units(2018), Zhangyu Xiao et al. [pdf]
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition(2019), Daniel S. Park et al. [pdf]
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [pdf]
Effectiveness of self-supervised pre-training for speech recognition(2020), Alexei Baevski et al. [pdf]
Improved Noisy Student Training for Automatic Speech Recognition(2020), Daniel S. Park, et al. [pdf]
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context (2020), Wei Han, et al. [pdf]
Conformer: Convolution-augmented Transformer for Speech Recognition(2020), Anmol Gulati, et al. [pdf]
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition(2020), Jinyu Li et al. [pdf]
Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations(2021), Melikasadat Emami et al. [pdf]
Efficient Training of Audio Transformers with Patchout(2021), Khaled Koutini et al. [pdf]
MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition(2021), Linghui Meng et al. [pdf]
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition(2021), Timo Lohrenz et al. [pdf]
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification(2021), Helin Wang et al. [pdf]
SpecMix: A Mixed Sample Data Augmentation method for Training with Time-Frequency Domain Features(2021), Gwantae Kim et al. [pdf]
The History of Speech Recognition to the Year 2030(2021), Awni Hannun et al. [pdf]
Voice Conversion Can Improve ASR in Very Low-Resource Settings(2021), Matthew Baas et al. [pdf]
Why does CTC result in peaky behavior?(2021), Albert Zeyer et al. [pdf]
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR(2022), W. Ronny Huang et al. [pdf]
Music Source Separation with Generative Flow(2022), Ge Zhu et al. [pdf]
Improving Self-Supervised Speech Representations by Disentangling Speakers(2022), Kaizhi Qian et al. [pdf]
Robust Speech Recognition via Large-Scale Weak Supervision(2022), Alec Radford et al. [pdf]
On decoder-only architecture for speech-to-text and large language model integration(2023), Jian Wu et al. [pdf]

Speaker Verification

Speaker Verification Using Adapted Gaussian Mixture Models(2000), Douglas A.Reynolds et al. [pdf]
A tutorial on text-independent speaker verification(2004), Frédéric Bimbot et al. [pdf]
Deep neural networks for small footprint text-dependent speaker verification(2014), E Variani et al. [pdf]
Deep Speaker Vectors for Semi Text-independent Speaker Verification(2015), Lantian Li et al. [pdf]
Deep Speaker: an End-to-End Neural Speaker Embedding System(2017), Chao Li et al. [pdf]
Deep Speaker Feature Learning for Text-independent Speaker Verification(2017), Lantian Li et al. [pdf]
Deep Speaker Verification: Do We Need End to End?(2017), Dong Wang et al. [pdf]
Speaker Diarization with LSTM(2017), Quan Wang et al. [pdf]
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks(2017), Amirsina Torfi et al. [pdf]
End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances(2017), Chunlei Zhang et al. [pdf]
Deep Neural Network Embeddings for Text-Independent Speaker Verification(2017), David Snyder et al. [pdf]
Deep Discriminative Embeddings for Duration Robust Speaker Verification(2018), Na Li et al. [pdf]
Learning Discriminative Features for Speaker Identification and Verification(2018), Sarthak Yadav et al. [pdf]
Large Margin Softmax Loss for Speaker Verification(2019), Yi Liu et al. [pdf]
Unsupervised feature enhancement for speaker verification(2019), Phani Sankar Nidadavolu et al. [pdf]
Feature enhancement with deep feature losses for speaker verification(2019), Saurabh Kataria et al. [pdf]
Generalized end2end loss for speaker verification(2019), Li Wan et al. [pdf]
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification(2019), Youngmoon Jung et al. [pdf]
VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge(2019), Son Chung et al. [pdf]
BUT System Description to VoxCeleb Speaker Recognition Challenge 2019(2019), Hossein Zeinali et al. [pdf]
The ID R&D System Description for Short-duration Speaker Verification Challenge 2021(2021), Alenin et al. [pdf]

Voice Conversion

Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks(2015), Lifa Sun et al. [pdf]
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training(2016), Lifa Sun et al. [pdf]
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks(2018), Hirokazu Kameoka et al. [pdf]
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss(2019), Kaizhi Qian et al. [pdf]
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion(2019), Takuhiro Kaneko et al. [pdf]
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion(2019), Andy T. Liu et al. [pdf]
Attention-Based Speaker Embeddings for One-Shot Voice Conversion(2020), Tatsuma Ishihara et al. [pdf]
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder(2020), Kaizhi Qian et al. [pdf]
Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning(2020), Jing-Xuan Zhang et al. [pdf]
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation(2021), Xiangheng He et al. [pdf]
crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder(2021), Kazuhiro Kobayashi et al. [pdf]
CVC: Contrastive Learning for Non-parallel Voice Conversion(2021), Tingle Li et al. [pdf]
NoiseVC: Towards High Quality Zero-Shot Voice Conversion(2021), Shijun Wang et al. [pdf]
On Prosody Modeling for ASR+TTS based Voice Conversion(2021), Wen-Chin Huang et al. [pdf]
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion(2021), Yinghao Aaron Li et al. [pdf]
Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning(2021), Shijun Wang et al. [pdf]

Speech Synthesis

Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]
Text-to-speech synthesis(2009), Paul Taylor et al. [pdf]
A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]
TTS synthesis with bidirectional LSTM based recurrent neural networks(2014), Yuchen Fan et al. [pdf]
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]
Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Char2Wav: End-to-end speech synthesis(2017), J Sotelo et al. [pdf]
Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]
Deep Voice 3: 2000-Speaker Neural Text-to-speech(2017), Wei Ping et al. [pdf]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis(2017), Aaron van den Oord et al. [pdf]
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework(2017), S Yang et al. [pdf]
Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
Uncovering Latent Style Factors for Expressive Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop(2017), Yaniv Taigman et al. [pdf]
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech(2018), Wei Ping et al. [pdf]
Deep Feed-forward Sequential Memory Networks for Speech Synthesis(2018), Mengxiao Bi et al. [pdf]
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction(2018), Jean-Marc Valin et al. [pdf]
Learning latent representations for style control and transfer in end-to-end speech synthesis(2018), Ya-Jie Zhang et al. [pdf]
Neural Voice Cloning with a Few Samples(2018), Sercan O. Arık et al. [pdf]
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis(2018), Daisy Stanton et al. [pdf]
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis(2018), Y Wang et al. [pdf]
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(2018), RJ Skerry-Ryan et al. [pdf]
DurIAN: Duration Informed Attention Network For Multimodal Synthesis(2019), Chengzhu Yu et al. [pdf]
Fast spectrogram inversion using multi-head convolutional neural networks(2019), SÖ Arık et al. [pdf]
FastSpeech: Fast, Robust and Controllable Text to Speech(2019), Yi Ren et al. [pdf]
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning(2019), Yu Zhang et al. [pdf]
MelNet: A Generative Model for Audio in the Frequency Domain(2019), Sean Vasquez et al. [pdf]
Multi-Speaker End-to-End Speech Synthesis(2019), Jihyun Park et al. [pdf]
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis(2019), Kundan Kumar et al. [pdf]
Neural Speech Synthesis with Transformer Network(2019), Naihan Li et al. [pdf]
Parallel Neural Text-to-Speech(2019), Kainan Peng et al. [pdf]
Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis(2019), Bing Yang et al.[pdf]
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram(2019), Ryuichi Yamamoto et al. [pdf] _{^{it comes out the same time as MelGAN, while no one refers to each other...Besides, I think the gaussian noise is unnecessary, since melspec has very strong information.}}
Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN(2019), David Alvarez et al. [pdf]
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS(2019), Mutian He et al. [pdf]
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models(2019), Wei Fang et al. [pdf]
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(2019), Ye Jia et al. [pdf]
WaveFlow: A Compact Flow-based Model for Raw Audio(2019), Wei Ping et al. [pdf]
Waveglow: A flow-based generative network for speech synthesis(2019), R Prenger et al. [pdf]
AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignmen(2020), Zhen Zeng et al. [pdf]
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization(2020), Henry B.Moss et al. [pdf]
Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems(2020), Ravichander Vipperla et al. [pdf]
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech(2020), Sri Karlapati et al. [pdf]
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture(2020), Chenfeng Miao et al. [pdf]
End-to-End Adversarial Text-to-Speech(2020), Jeff Donahue et al. [pdf]
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech(2020), Yi Ren et al. [pdf]
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis(2020), Rafael Valle et al. [pdf]
Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow(2020), Chenfeng Miao et al. [pdf]
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis(2020), Guangzhi Sun et al. [pdf]
Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior(2020), Guangzhi Sun et al. [pdf]
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search(2020), Jaehyeon Kim et al. [pdf]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis(2020), Jungil Kong et al. [pdf]
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesi(2020), Eric Battenberg et al. [pdf]
MultiSpeech: Multi-Speaker Text to Speech with Transformer(2020), Mingjian Chen et al. [pdf]
Parallel Tacotron: Non-Autoregressive and Controllable TTS(2020), Isaac Elias et al. [pdf]
RobuTrans: A Robust Transformer-Based Text-to-Speech Model(2020), Naihan Li et al. [pdf]
Text-Independent Speaker Verification with Dual Attention Network(2020), Jingyu Li et al. [pdf]
WaveGrad: Estimating Gradients for Waveform Generation(2020), Nanxin Chen et al. [pdf]
AdaSpeech: Adaptive Text to Speech for Custom Voice(2021), Mingjian Chen et al. [pdf]
A Survey on Neural Speech Synthesis(2021), Xu Tan et al. [pdf]
A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate(2021), Ahmed Mustafa et al. [pdf]
Controllable cross-speaker emotion transfer for end-to-end speech synthesis(2021), Tao Li et al. [pdf]
Cloning one’s voice using very limited data in the wild(2021), Dongyang Dai et al. [pdf]
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech(2021), Jaehyeon Kim et al. [pdf]
DiffWave: A Versatile Diffusion Model for Audio Synthesis(2021), Zhifeng Kong et al. [pdf]
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech(2021), Myeonghun Jeong et al. [pdf]
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021(2021), Yanqing Liu et al. [pdf]
Fre-GAN: Adversarial Frequency-consistent Audio Synthesis(2021), Ji-Hoon Kim et al. [pdf]
Full-band LPCNet: A real-time neural vocoder for 48 kHz audio with a CPU(2021), Keisuke Matsubara et al. [pdf]
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech(2021), Vadim Popov et al. [pdf]
Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis(2021), Jian Cong et al. [pdf]
High-fidelity and low-latency universal neural vocoder based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling(2021), Patrick Lumban Tobing et al. [pdf]
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis(2021), Chung-Ming Chien et al. [pdf]
ItoˆTTS and ItoˆWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation(2021), Shoule Wu et al. [pdf]
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech(2021), Dan Lim et al. [pdf]
meta-voice: fast few-shot style transfer for expressive voice cloning using meta learning(2021), Songxiang Liu et al. [pdf]
Neural HMMs are all you need (for high-quality attention-free TTS)(2021), Shivam Mehta et al. [pdf]
Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet(2021), Max Morrison et al. [pdf]
One TTS Alignment To Rule Them All(2021), Rohan Badlani et al. [pdf]
KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke(2021), Xiaobin Zhuang et al. [pdf]
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS(2021), Ye Jia et al. [pdf]
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling(2021), Isaac Elias et al. [pdf]
PortaSpeech: Portable and High-Quality Generative Text-to-Speech(2021), Yi Ren et al. [pdf]
Transformer-based Acoustic Modeling for Streaming Speech Synthesis(2021), Chunyang Wu et al. [pdf]
Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet(2021), Shilun Lin et al. [pdf]
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction(2021), Stanislav Beliaev et al. [pdf] _{^{TalkNet2 has minor difference from TalkNet,so I don't include TalkNet here.}}
Towards Multi-Scale Style Control for Expressive Speech Synthesis(2021), Xiang Li et al. [pdf]
Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN(2021), Reo Yoneyama et al. [pdf]
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone(2021), Edresson Casanova et al. [pdf]
Avocodo: Generative Adversarial Network for Artifact-free Vocoder(2022), Taejun Bak et al. [pdf]
Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech(2022), Byoung Jin Choi et al. [pdf]
Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge(2022), Sangjun Park et al. [pdf]
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation(2022), Ryo Terashima et al. [pdf]
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis(2022), Rongjie Huang et al. [pdf]
Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPU(2022), Ivan Vovk et al. [[pdf]
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion(2022), Yi Lei et al. [pdf]
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement(2022), Pavel Andreev et al. [pdf]
IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion(2022), Wendong Gan et al. [pdf]
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform(2022), Takuhiro Kaneko et al. [pdf]
Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform(2022), Masaya Kawamura et al. [pdf]
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet(2022), Jean-Marc Valin et al. [pdf]
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis(2022), Hyeong-Seok Choi et al. [pdf]
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior(2022), Sang-gil Lee et al. [pdf]
PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [pdf]
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech(2022), Hyunjae Cho et al. [pdf]
STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency(2022), Zhong-Qiu Wang et al. [pdf]
Simple and Effective Unsupervised Speech Synthesis(2022), Alexander H. Liu et al. [pdf]
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping(2022), Yuma Koizumi et al. [pdf]
Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder(2022), Reo Yoneyama et al. [pdf]
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner(2022), Yoon-Cheol Ju et al. [pdf]
Zero-Shot Cross-Lingual Transfer Using Multi-Stream Encoder and Efficient Speaker Representation(2022), Yibin Zheng et al. [pdf]
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt(2023), Dongchao Yang et al. [pdf]
Matcha-TTS: A fast TTS architecture with conditional flow matching(2023), Shivam Mehta et al. [pdf]
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias(2023), Ziyue Jiang et al. [pdf]
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts(2023), Ziyue Jiang et al. [pdf]

Language Modelling

Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]
An empirical study of smoothing techniques for language modeling(1996), Stanley F. Chen et al. [pdf]
A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]
A new statistical approach to Chinese Pinyin input(2000), Zheng Chen et al. [pdf]
Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]
Neural Network Language Model for Chinese Pinyin Input Method Engine(2015), S Chen et al. [pdf]
Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]
Exploring the limits of language modeling(2016), R Jozefowicz et al. [pdf]
On the State of the Art of Evaluation in Neural Language Models(2016), G Melis et al. [pdf]
Pay Less Attention with Lightweight and Dynamic Convolutions(2019), Felix Wu et al.[pdf]

Confidence Estimates

Estimating Confidence using Word Lattices(1997), T. Kemp et al. [pdf]
Large vocabulary decoding and confidence estimation using word posterior probabilities(2000), G. Evermann et al. [pdf]
Combining Information Sources for Confidence Estimation with CRF Models(2011), M. S. Seigel et al. [pdf]
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks(2018), M. ́A. Del-Agua et al. [pdf]
Bi-Directional Lattice Recurrent Neural Networks for Confidence Estimation(2018), Q. Li et al. [pdf]
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks(2020), A. Kastanos et al. [pdf]
CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION(2020), Qiujia Li et al. [pdf]
Residual Energy-Based Models for End-to-End Speech Recognition(2021), Qiujia Li et al. [pdf]
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction(2021), David Qiu et al. [pdf]

Music Modelling

Onsets and Frames: Dual-Objective Piano Transcription(2017), Curtis Hawthorne et al. [pdf]
Unsupervised Singing Voice Conversion(2019), Eliya Nachmani et al. [pdf]
ByteSing- A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders(2020), Yu Gu et al. [pdf]
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System(2020), Liqiang Zhang et al. [pdf]
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis(2020), Jiawei Chen et al. [pdf]
Jukebox: A Generative Model for Music(2020), Prafulla Dhariwal et al. [pdf]
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism(2021), Jinglin Liu et al. [pdf]
MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis(2021), Jaesung Tae et al. [pdf]
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus(2021), Rongjie Huang et al. [pdf]
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training(2021), Mingliang Zeng et al. [pdf]
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement(2021), Gyeong-Hoon Lee et al. [pdf]
Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech(2021), Raahil Shah et al. [pdf]
PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components(2021), Yukiya Hono et al. [pdf]
Sequence-to-Sequence Piano Transcription with Transformers(2021), Curtis Hawthorne et al. [pdf]
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus(2022), Lichao Zhang et al. [pdf]
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis(2022), Yu Wang et al. [pdf]
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses(2022), Zewang Zhang et al. [pdf]
WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training(2022), Zewang Zhang et al. [pdf]

Interesting papers

The Reversible Residual Network: Backpropagation Without Storing Activations(2017), Aidan N. Gomez et al. [pdf]
Soft-DTW: a Differentiable Loss Function for Time-Series(2018), Marco Cuturi et al. [pdf]
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow(2019), Xuezhe Ma et al. [pdf]
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks(2019), Santiago Pascual et al. [pdf]
Self-supervised audio representation learning for mobile devices(2019), Marco Tagliasacchi et al. [pdf]
SinGAN: Learning a Generative Model from a Single Natural Image(2019), Tamar Rott Shaham et al. [pdf]
Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks(2019), Guanzhong Tian et al. [pdf]
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms(2020), Goro Kobayashi et al. [pdf]

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
LICENSE		LICENSE
README.md		README.md
channel_id.jpg		channel_id.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-speech-recognition-speech-synthesis-papers

Paper List

Text to Audio

Automatic Speech Recognition

Speaker Verification

Voice Conversion

Speech Synthesis

Language Modelling

Confidence Estimates

Music Modelling

Interesting papers

About

Releases

Packages

Contributors 12

License

zzw922cn/awesome-speech-recognition-speech-synthesis-papers

Folders and files

Latest commit

History

Repository files navigation

awesome-speech-recognition-speech-synthesis-papers

Paper List

Text to Audio

Automatic Speech Recognition

Speaker Verification

Voice Conversion

Speech Synthesis

Language Modelling

Confidence Estimates

Music Modelling

Interesting papers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Packages