v0.12.0
TorchAudio 0.12.0 Release Notes
Highlights
TorchAudio 0.12.0 includes the following:
- CTC beam search decoder
- New beamforming modules and methods
- Streaming API
[Beta] CTC beam search decoder
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.
For usage details, please check out the documentation and ASR inference tutorial.
[Beta] New beamforming modules and methods
To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms
: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:
- Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
- Add
reference_channel
as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.
Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional
. These include
For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.
[Beta] Streaming API
StreamReader
is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
- Decode various audio and video formats, including MP4 and AAC.
- Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
- Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
- Apply various audio and video filters, such as low-pass filter and image scaling.
- Decode video with Nvidia's hardware-based decoder (NVDEC).
For usage details, please check out the documentation and tutorials:
- Media Stream API - Pt.1
- Media Stream API - Pt.2
- Online ASR with Emformer RNN-T
- Device ASR with Emformer RNN-T
- Accelerated Video Decoding with NVDEC
† To use StreamReader
, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.
Backwards-incompatible changes
I/O
- MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
torchaudio.load
, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution). - Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
torchaudio.info
now returnsnum_frames=0
for MP3.
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with
Models
- Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11,
Hypothesis
subclassednamedtuple
. Containers ofnamedtuple
instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility,Hypothesis
has been modified in release 0.12 to instead aliastuple
. This affectsRNNTBeamSearch
as it accepts and returns a list ofHypothesis
instances.
- In release 0.11,
Bug Fixes
Ops
- Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to
complex128
to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.
- In release 0.11, the MVDR module converts the dtype of input spectrum to
Build
- Fix Kaldi submodule integration (#2269)
- Pin jinja2 version for build_docs (#2292)
- Use sourceforge url to fetch zlib (#2297)
New Features
I/O
- Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
- Add YUV420P format support to Streaming API (#2334)
- Support specifying decoder and its options (#2327)
- Add NV12 format support in Streaming API (#2330)
- Add HW acceleration support on Streaming API (#2331)
- Add file-like object support to Streaming API (#2400)
- Make FFmpeg log level configurable (#2439)
- Set the default ffmpeg log level to FATAL (#2447)
Ops
- New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
- New MVDR modules (#2367, #2368)
- Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
- Add lexicon free CTC decoder (#2342)
- Add Pretrained LM Support for Decoder (#2275)
- Move CTC beam search decoder to beta (#2410)
Datasets
Improvements
I/O
Ops
- Raise error for resampling int waveform (#2318)
- Move multi-channel modules to a separate file (#2382)
- Refactor MVDR module (#2383)
Models
- Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
- Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
- Add extra arguments to hubert pretrain factory functions (#2345)
- Add feature_grad_mult argument to HuBERTPretrainModel (#2335)
Datasets
Performance
- Make Pitchshift for faster by caching resampling kernel (#2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takestorchaudio.transforms.PitchShift
, after its first call, to perform the operation onfloat32
Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.
TorchAudio Version | 2 | 3 | 4 | 5 |
---|---|---|---|---|
0.12 | 2.76 | 5 | 1860 | 223 |
0.11 | 6.71 | 161 | 8680 | 1450 |
Tests
- Add complex dtype support in functional autograd test (#2244)
- Refactor torchscript consistency test in functional (#2246)
- Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
- Refactor batch consistency test in functional (#2245)
- Run smoke tests on regular PRs (#2364)
- Refactor smoke test executions (#2365)
- Move seed to setup (#2425)
- Remove possible manual seeds from test files (#2436)
Build
- Revise the parameterization of third party libraries (#2282)
- Use zlib v1.2.12 with GitHub source (#2300)
- Fix ffmpeg integration for ffmpeg 5.0 (#2326)
- Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
- Adding m1 builds to torchaudio (#2421)
Other
- Add download utility specialized for torchaudio (#2283)
- Use module-level
__getattr__
to implement delayed initialization (#2377) - Update build_doc job to use Conda CUDA package (#2395)
- Update I/O initialization (#2417)
- Add Python 3.10 (build and test) (#2224)
- Retrieve version from version.txt (#2434)
- Disable OpenMP on mac (#2431)
Examples
Ops
- Add CTC decoder example for librispeech (#2130, #2161)
- Fix LM, arguments in CTC decoding script (#2235, #2315)
- Use pretrained LM API for decoder example (#2317)
Pipelines
- Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
- Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
- Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
- Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)
Tests
- Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
- Add fixed random seed for Emformer RNN-T recipe test (#2220)
Training recipes
- Add recipe for HuBERT model pre-training (#2143, #2198, #2296, #2310, #2311, #2412)
- Add HuBERT fine-tuning recipe (#2352)
- Refactor Emformer RNNT recipes (#2212)
- Fix bugs from Emformer RNN-T recipes merge (#2217)
- Add SentencePiece model training script for LibriSpeech Emformer RNN-T (#2218)
- Add training recipe for Emformer RNNT trained on MuST-C release v2.0 dataset (#2219)
- Refactor ArgumentParser arguments in emformer_rnnt recipes (#2236)
- Add shebang lines to scripts in emformer_rnnt recipes (#2237)
- Introduce DistributedBatchSampler (#2299)
- Add Conformer RNN-T LibriSpeech training recipe (#2329)
- Refactor LibriSpeech Conformer RNN-T recipe (#2366)
- Refactor LibriSpeech Lightning datamodule to accommodate different dataset implementations (#2437)
Prototypes
Models
- Add Conformer RNN-T model prototype (#2322)
- Add ConvEmformer module (streaming-capable Conformer) (#2324, #2358)
- Add conv_tasnet_base factory function to prototype (#2411)
Pipelines
- Add EMFORMER_RNNT_BASE_MUSTC bundle to torchaudio.prototype (#2241)
Documentation
- Add ASR CTC decoding inference tutorial (#2106)
- Update context building to not delay the inference (#2213)
- Update online ASR tutorial (#2226)
- Update CTC decoder docs and add citation (#2278)
- [Doc] fix typo and backlink (#2281)
- Fix calculation of SNR value in tutorial (#2285)
- Add notes about prototype features in tutorials (#2288)
- Update README around version compatibility matrix (#2293)
- Update decoder pretrained lm docs (#2291)
- Add devices/properties badges (#2321)
- Fix LibriMix documentation (#2351)
- Update wavernn.py (#2347)
- Add citations for datasets (#2371)
- Update audio I/O tutorials (#2385)
- Update MVDR beamforming tutorial (#2398)
- Update audio feature extraction tutorial (#2391)
- Update audio resampling tutorial (#2386)
- Update audio data augmentation tutorial (#2388)
- Add tutorial to use NVDEC with Stream API (#2393)
- Expand subsections in tutorials by default (#2397)
- Fix documentation (#2407)
- Fix documentation (#2409)
- Dataset doc fixes (#2426)
- Update CTC decoder docs (#2443)
- Split Streaming API tutorials into two (#2446)
- Update HW decoding tutorial and add notes about unseekable object (#2408)