Skip to content

v0.4

Compare
Choose a tag to compare
@zkoch zkoch released this 27 Aug 01:12
b649b9f

Hey everyone,

We're releasing Ultravox 0.4 today. The weights have been pushed to Hugging Face (along with updated datasets for training). If you're using the Ultravox APIs, v0.4 is the new default.

There are two key differences between 0.3 and 0.4:

  • We've upgraded the Whisper encoder from Whisper-small to Whisper-medium
  • We've trained on a larger set of multi-lingual data. Previous versions of Ultravox were only trained on English. Supported languages are now ar, de, en, es, fr, it, ja, pt, ru.

v0.4 builds upon the work in 0.3 and continues to show improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better). ca and zh are examples of model performance for languages not included in training.

Ultravox 0.3 Ultravox 0.4
en_ar 9.07 28.07
en_de 22.67 25.60
es_en 24.10 31.03
ru_en 22.52 38.96
en_ca 24.87 27.49
zh_en 4.26 10.08

This version of Ultravox continues to use a frozen Llama 3.1 8B pre-trained core, but we've roughly doubled the size of the data and the overall training time. The speech adapter was trained on ~5k hours of speech from LibriSpeech, Common Voice, Peoples Speech, and AnyInstruct. The training time on 8xH100s is roughly 170 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.3 was trained on ~2.5k hours of audio.

We'd love to hear feedback on your experience with Ultravox, along with feature suggestions. Roadmap coming soon.

What's Changed

Full Changelog: v0.3...v0.4