中文 | English
Qwen2-Audio-7B 🤖 | 🤗 | Qwen-Audio-7B-Instruct 🤖 | 🤗 | Demo 🤖 | 🤗
📑 Paper | 📑 Blog | 💬 WeChat (微信) | Discord
我们介绍Qwen-Audio的最新进展:Qwen2-Audio。作为一个大规模音频语言模型,Qwen2-Audio能够接受各种音频信号输入,并根据语音指令执行音频分析或直接响应文本。我们介绍两种不同的音频交互模式:语音聊天 voice chat 和音频分析 audio analysis。
- 语音聊天:用户可以自由地与 Qwen2-Audio 进行语音互动,而无需文本输入;
- 音频分析:用户可以在互动过程中提供音频和文本指令对音频进行分析;
我们已经开源了 Qwen2-Audio 系列的两个模型:Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct。
Qwen2-Audio 三阶段训练过程概述。
- 2024.8.9 🎉 我们在 ModelScope 和 Hugging Face 开源了
Qwen2-Audio-7B
和Qwen2-Audio-7B-Instruct
的 checkpoint. - 2024.7.15 🎉 我们发布了 Qwen2-Audio 的论文, 介绍了相关的模型结构,训练方法和模型表现。
- 2023.11.30 🔥 我们发布了Qwen-Audio系列
我们在标准的13个学术数据集上评测了模型的能力如下:
Task | Description | Dataset | Split | Metric |
---|---|---|---|---|
ASR | Automatic Speech Recognition | Fleurs | dev | test | WER |
Aishell2 | test | |||
Librispeech | dev | test | |||
Common Voice | dev | test | |||
S2TT | Speech-to-Text Translation | CoVoST2 | test | BLEU |
SER | Speech Emotion Recognition | Meld | test | ACC |
VSC | Vocal Sound Classification | VocalSound | test | ACC |
AIR-Bench | Chat-Benchmark-Speech | Fisher SpokenWOZ IEMOCAP Common voice | dev | test | GPT-4 Eval |
Chat-Benchmark-Sound | Clotho | dev | test | GPT-4 Eval | |
Chat-Benchmark-Music | MusicCaps | dev | test | GPT-4 Eval | |
Chat-Benchmark-Mixed-Audio | Common voice AudioCaps MusicCaps | dev | test | GPT-4 Eval |
以下是整体表现:
评测分数详情如下:
(注意:我们所展示的评测结果是在原始训练框架的初始模型上的,然而在框架转换 Huggingface 后指标出现了部分波动,在这里我们展示我们的全部测评结果:首先是论文中的初始模型结果)
Task | Dataset | Model | Performance | |
---|---|---|---|---|
Metrics | Results | |||
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - | |||
SLM-FT | - | - | 2.6 | 5.0 | |||
SALMONN | - | - | 2.1 | 4.9 | |||
SpeechVerse | - | - | 2.1 | 4.4 | |||
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 | |||
Qwen2-Audio | 1.3 | 3.4 | 1.6 | 3.6 | |||
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 | |
Qwen2-Audio | 8.6 | 6.9 | 5.9 | 9.6 | |||
Fleurs zh | Whisper-large-v3 | WER | 7.7 | |
Qwen2-Audio | 7.5 | |||
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 | |
Paraformer-large | - | 2.9 | - | |||
Qwen-Audio | 3.3 | 3.1 | 3.3 | |||
Qwen2-Audio | 3.0 | 3.0 | 2.9 | |||
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 | |||
BLSP | 14.1 | - | - | - | |||
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 | |||
Qwen2-Audio | 29.9 | 35.2 | 45.2 | 24.4 | |||
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 | |
Qwen-Audio | 39.7 | 38.5 | 36.0 | |||
Qwen2-Audio | 40.0 | 38.5 | 36.3 | |||
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 | |||
Qwen2-Audio | 0.553 | |||
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 | |||
Qwen-Audio | 0.9289 | |||
Qwen2-Audio | 0.9392 | |||
AIR-Bench | Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.18 | 6.99 | 6.79 | 6.77 |
(其次是转换 huggingface 后的)
Task | Dataset | Model | Performance | |
---|---|---|---|---|
Metrics | Results | |||
ASR | Librispeech dev-clean | dev-other | test-clean | test-other | SpeechT5 | WER | 2.1 | 5.5 | 2.4 | 5.8 |
SpeechNet | - | - | 30.7 | - | |||
SLM-FT | - | - | 2.6 | 5.0 | |||
SALMONN | - | - | 2.1 | 4.9 | |||
SpeechVerse | - | - | 2.1 | 4.4 | |||
Qwen-Audio | 1.8 | 4.0 | 2.0 | 4.2 | |||
Qwen2-Audio | 1.7 | 3.6 | 1.7 | 4.0 | |||
Common Voice 15 en | zh | yue | fr | Whisper-large-v3 | WER | 9.3 | 12.8 | 10.9 | 10.8 | |
Qwen2-Audio | 8.7 | 6.5 | 5.9 | 9.6 | |||
Fleurs zh | Whisper-large-v3 | WER | 7.7 | |
Qwen2-Audio | 7.0 | |||
Aishell2 Mic | iOS | Android | MMSpeech-base | WER | 4.5 | 3.9 | 4.0 | |
Paraformer-large | - | 2.9 | - | |||
Qwen-Audio | 3.3 | 3.1 | 3.3 | |||
Qwen2-Audio | 3.2 | 3.1 | 2.9 | |||
S2TT | CoVoST2 en-de | de-en | en-zh | zh-en | SALMONN | BLEU | 18.6 | - | 33.1 | - |
SpeechLLaMA | - | 27.1 | - | 12.3 | |||
BLSP | 14.1 | - | - | - | |||
Qwen-Audio | 25.1 | 33.9 | 41.5 | 15.7 | |||
Qwen2-Audio | 29.6 | 33.6 | 45.6 | 24.0 | |||
CoVoST2 es-en | fr-en | it-en | | SpeechLLaMA | BLEU | 27.9 | 25.2 | 25.9 | |
Qwen-Audio | 39.7 | 38.5 | 36.0 | |||
Qwen2-Audio | 38.7 | 37.2 | 35.2 | |||
SER | Meld | WavLM-large | ACC | 0.542 |
Qwen-Audio | 0.557 | |||
Qwen2-Audio | 0.535 | |||
VSC | VocalSound | CLAP | ACC | 0.4945 |
Pengi | 0.6035 | |||
Qwen-Audio | 0.9289 | |||
Qwen2-Audio | 0.9395 | |||
AIR-Bench | Chat Benchmark Speech | Sound | Music | Mixed-Audio | SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio | GPT-4 | 6.16 | 6.28 | 5.95 | 6.08 6.17 | 5.55 | 5.08 | 5.33 3.58 | 5.46 | 5.06 | 4.25 0.97 | 1.01 | 0.91 | 1.01 1.57 | 0.95 | 0.95 | 4.13 3.86 | 4.76 | 4.18 | 4.13 6.47 | 6.95 | 5.52 | 6.08 6.97 | 5.49 | 5.06 | 5.27 7.24 | 6.83 | 6.73 | 6.42 |
我们提供了以上所有评测脚本以供复现我们的实验结果。请阅读 eval_audio/EVALUATION.md 了解更多信息。
The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers
, or you might encounter the following error:
Qwen2-Audio的代码已经包含在最新的 Hugging Face Transformers 的主分支中,我们建议您使用命令pip install git+https://github.com/huggingface/transformers
KeyError: 'qwen2-audio'
我们提供简单的示例来说明如何利用 🤗 Transformers 快速使用 Qwen2-Audio-7B 和 Qwen2-Audio-7B-Instruct。 在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 接下来你可以开始使用 Transformers 或者 ModelScope 来使用我们的模型。目前Qwen2-Audio-7B 及 Qwen2-Audio-7B-Instruct 模型处理30秒以内的音频表现更佳。
如希望使用 Qwen2-Audio-7B-Instruct 进行推理,我们分别演示语音聊天和音频分析的交互方式,所需要写的只是如下所示的数行代码。
在语音聊天模式下,用户可以自由地与 Qwen2-Audio 进行语音交互,无需文字输入:
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
]},
{"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
在音频分析中,用户可以提供音频和文字问题来实现对音频的分析:
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
{"type": "text", "text": "What's that sound?"},
]},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{"role": "user", "content": [
{"type": "text", "text": "What can you do when you hear that?"},
]},
{"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
{"type": "text", "text": "What does the person say?"},
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(
librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
我们也支持批量推理:
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")
conversation1 = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
{"type": "text", "text": "What's that sound?"},
]},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
{"type": "text", "text": "What can you hear?"},
]}
]
conversation2 = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
{"type": "text", "text": "What does the person say?"},
]},
]
conversations = [conversation1, conversation2]
text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]
audios = []
for conversation in conversations:
for message in conversation:
if isinstance(message["content"], list):
for ele in message["content"]:
if ele["type"] == "audio":
audios.append(
librosa.load(
BytesIO(urlopen(ele['audio_url']).read()),
sr=processor.feature_extractor.sampling_rate)[0]
)
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")
generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
运行Qwen2-Audio-7B同样非常简单。
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")
generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
我们强烈建议用户,特别是中国大陆地区的用户,使用 ModelScope。snapshot_download
可以帮助您解决下载检查点时遇到的问题。
我们提供了 Web UI 的 demo 供用户使用。在开始前,确保已经安装如下代码库:
pip install -r requirements_web_demo.txt
随后运行如下命令,并点击生成链接:
python demo/web_demo_audio.py
更多样例将更新于通义千问博客中的 Qwen2-Audio 博客。
我们是通义千问语音多模态团队,致力于以通义千问为核心,拓展音频多模态理解和生成能力,实现自由灵活的音频交互。目前团队蓬勃发展中,如有意向实习或全职加入我们,请发送简历至 qwen_audio@list.alibaba-inc.com
.
请查看每个模型在其 Hugging Face 仓库中的许可证。您无需提交商业使用申请。
如果你觉得我们的论文和代码对你的研究有帮助,请考虑 ⭐ 和引用 📝 :)
@article{Qwen-Audio,
title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2311.07919},
year={2023}
}
@article{Qwen2-Audio,
title={Qwen2-Audio Technical Report},
author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo, Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2407.10759},
year={2024}
}
如果你想给我们的研发团队和产品团队留言,请通过邮件(qianwen_opensource@alibabacloud.com
)联系我们。