v0.4
Hey everyone,
We're releasing Ultravox 0.4 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.
There are two key differences between 0.3 and 0.4:
- We've upgraded the Whisper encoder from Whisper-small to Whisper-medium
- We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now
ar
,de
,en
,es
,fr
,it
,ja
,pt
,ru
.
v0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca
and zh
are examples of model performance for languages not included in training.
Ultravox 0.3 | Ultravox 0.4 | |
---|---|---|
en_ar | 9.07 | 28.07 |
en_de | 22.67 | 25.60 |
es_en | 24.10 | 31.03 |
ru_en | 22.52 | 38.96 |
en_ca | 24.87 | 27.49 |
zh_en | 4.26 | 10.08 |
This version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.
We'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.
What's Changed
- Update gradio demo to support text/voice conversation by @zqhuang211 in #75
- Offline batch inference mode by @liPatrick in #82
- Live reload for Gradio demo by @juberti in #89
- Working AutoProcessor.from_pretrained by @farzadab in #92
- Use bfloat16 by default on MPS by @juberti in #95
- Add retry and filter in ds tool by @liPatrick in #81
- Change tokenizer padding_side to left for eval by @zqhuang211 in #96
- Make v0.4 release by @zqhuang211 in #99
Full Changelog: v0.3...v0.4