中文｜ English

我们介绍Qwen-Audio的最新进展：Qwen2-Audio。作为一个大规模音频语言模型，Qwen2-Audio能够接受各种音频信号输入，并根据语音指令执行音频分析或直接响应文本。我们介绍两种不同的音频交互模式：语音聊天 voice chat 和音频分析 audio analysis。

语音聊天：用户可以自由地与 Qwen2-Audio 进行语音互动，而无需文本输入；
音频分析：用户可以在互动过程中提供音频和文本指令对音频进行分析；

我们已经开源了 Qwen2-Audio 系列的两个模型：Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct。

模型结构与训练范式

Qwen2-Audio 三阶段训练过程概述。

新闻

2024.8.9 🎉 我们在 ModelScope 和 Hugging Face 开源了Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct的 checkpoint.
2024.7.15 🎉 我们发布了 Qwen2-Audio 的论文, 介绍了相关的模型结构，训练方法和模型表现。
2023.11.30 🔥 我们发布了Qwen-Audio系列

评测

我们在标准的13个学术数据集上评测了模型的能力如下：

Task	Description	Dataset	Split	Metric
ASR	Automatic Speech Recognition	Fleurs	dev \| test	WER
		Aishell2	test
		Librispeech	dev \| test
		Common Voice	dev \| test
S2TT	Speech-to-Text Translation	CoVoST2	test	BLEU
SER	Speech Emotion Recognition	Meld	test	ACC
VSC	Vocal Sound Classification	VocalSound	test	ACC
AIR-Bench	Chat-Benchmark-Speech	Fisher SpokenWOZ IEMOCAP Common voice	dev \| test	GPT-4 Eval
	Chat-Benchmark-Sound	Clotho	dev \| test	GPT-4 Eval
	Chat-Benchmark-Music	MusicCaps	dev \| test	GPT-4 Eval
	Chat-Benchmark-Mixed-Audio	Common voice AudioCaps MusicCaps	dev \| test	GPT-4 Eval

以下是整体表现：

评测分数详情如下：
（注意：我们所展示的评测结果是在原始训练框架的初始模型上的，然而在框架转换 Huggingface 后指标出现了部分波动，在这里我们展示我们的全部测评结果：首先是论文中的初始模型结果）

Task	Dataset	Model	Performance
Task	Dataset	Model	Metrics	Results
ASR	Librispeech dev-clean \| dev-other \| test-clean \| test-other	SpeechT5	WER	2.1 \| 5.5 \| 2.4 \| 5.8
		SpeechNet		- \| - \| 30.7 \| -
		SLM-FT		- \| - \| 2.6 \| 5.0
		SALMONN		- \| - \| 2.1 \| 4.9
		SpeechVerse		- \| - \| 2.1 \| 4.4
		Qwen-Audio		1.8 \| 4.0 \| 2.0 \| 4.2
		Qwen2-Audio		1.3 \| 3.4 \| 1.6 \| 3.6
	Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	WER	9.3 \| 12.8 \| 10.9 \| 10.8
	Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	WER	8.6 \| 6.9 \| 5.9 \| 9.6
	Fleurs zh	Whisper-large-v3	WER	7.7
	Fleurs zh	Qwen2-Audio	WER	7.5
	Aishell2 Mic \| iOS \| Android	MMSpeech-base	WER	4.5 \| 3.9 \| 4.0
		Paraformer-large		- \| 2.9 \| -
		Qwen-Audio		3.3 \| 3.1 \| 3.3
		Qwen2-Audio		3.0 \| 3.0 \| 2.9
S2TT	CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	BLEU	18.6 \| - \| 33.1 \| -
		SpeechLLaMA		- \| 27.1 \| - \| 12.3
		BLSP		14.1 \| - \| - \| -
		Qwen-Audio		25.1 \| 33.9 \| 41.5 \| 15.7
		Qwen2-Audio		29.9 \| 35.2 \| 45.2 \| 24.4
	CoVoST2 es-en \| fr-en \| it-en \|	SpeechLLaMA	BLEU	27.9 \| 25.2 \| 25.9
		Qwen-Audio		39.7 \| 38.5 \| 36.0
		Qwen2-Audio		40.0 \| 38.5 \| 36.3
SER	Meld	WavLM-large	ACC	0.542
		Qwen-Audio		0.557
		Qwen2-Audio		0.553
VSC	VocalSound	CLAP	ACC	0.4945
		Pengi		0.6035
		Qwen-Audio		0.9289
		Qwen2-Audio		0.9392
AIR-Bench	Chat Benchmark Speech \| Sound \| Music \| Mixed-Audio	SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio	GPT-4	6.16 \| 6.28 \| 5.95 \| 6.08 6.17 \| 5.55 \| 5.08 \| 5.33 3.58 \| 5.46 \| 5.06 \| 4.25 0.97 \| 1.01 \| 0.91 \| 1.01 1.57 \| 0.95 \| 0.95 \| 4.13 3.86 \| 4.76 \| 4.18 \| 4.13 6.47 \| 6.95 \| 5.52 \| 6.08 6.97 \| 5.49 \| 5.06 \| 5.27 7.18 \| 6.99 \| 6.79 \| 6.77

（其次是转换 huggingface 后的）

Task	Dataset	Model	Performance
Task	Dataset	Model	Metrics	Results
ASR	Librispeech dev-clean \| dev-other \| test-clean \| test-other	SpeechT5	WER	2.1 \| 5.5 \| 2.4 \| 5.8
		SpeechNet		- \| - \| 30.7 \| -
		SLM-FT		- \| - \| 2.6 \| 5.0
		SALMONN		- \| - \| 2.1 \| 4.9
		SpeechVerse		- \| - \| 2.1 \| 4.4
		Qwen-Audio		1.8 \| 4.0 \| 2.0 \| 4.2
		Qwen2-Audio		1.7 \| 3.6 \| 1.7 \| 4.0
	Common Voice 15 en \| zh \| yue \| fr	Whisper-large-v3	WER	9.3 \| 12.8 \| 10.9 \| 10.8
	Common Voice 15 en \| zh \| yue \| fr	Qwen2-Audio	WER	8.7 \| 6.5 \| 5.9 \| 9.6
	Fleurs zh	Whisper-large-v3	WER	7.7
	Fleurs zh	Qwen2-Audio	WER	7.0
	Aishell2 Mic \| iOS \| Android	MMSpeech-base	WER	4.5 \| 3.9 \| 4.0
		Paraformer-large		- \| 2.9 \| -
		Qwen-Audio		3.3 \| 3.1 \| 3.3
		Qwen2-Audio		3.2 \| 3.1 \| 2.9
S2TT	CoVoST2 en-de \| de-en \| en-zh \| zh-en	SALMONN	BLEU	18.6 \| - \| 33.1 \| -
		SpeechLLaMA		- \| 27.1 \| - \| 12.3
		BLSP		14.1 \| - \| - \| -
		Qwen-Audio		25.1 \| 33.9 \| 41.5 \| 15.7
		Qwen2-Audio		29.6 \| 33.6 \| 45.6 \| 24.0
	CoVoST2 es-en \| fr-en \| it-en \|	SpeechLLaMA	BLEU	27.9 \| 25.2 \| 25.9
		Qwen-Audio		39.7 \| 38.5 \| 36.0
		Qwen2-Audio		38.7 \| 37.2 \| 35.2
SER	Meld	WavLM-large	ACC	0.542
		Qwen-Audio		0.557
		Qwen2-Audio		0.535
VSC	VocalSound	CLAP	ACC	0.4945
		Pengi		0.6035
		Qwen-Audio		0.9289
		Qwen2-Audio		0.9395
AIR-Bench	Chat Benchmark Speech \| Sound \| Music \| Mixed-Audio	SALMONN BLSP Pandagpt Macaw-LLM SpeechGPT Next-gpt Qwen-Audio Gemini-1.5-pro Qwen2-Audio	GPT-4	6.16 \| 6.28 \| 5.95 \| 6.08 6.17 \| 5.55 \| 5.08 \| 5.33 3.58 \| 5.46 \| 5.06 \| 4.25 0.97 \| 1.01 \| 0.91 \| 1.01 1.57 \| 0.95 \| 0.95 \| 4.13 3.86 \| 4.76 \| 4.18 \| 4.13 6.47 \| 6.95 \| 5.52 \| 6.08 6.97 \| 5.49 \| 5.06 \| 5.27 7.24 \| 6.83 \| 6.73 \| 6.42

我们提供了以上所有评测脚本以供复现我们的实验结果。请阅读 eval_audio/EVALUATION.md 了解更多信息。

部署要求

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error: Qwen2-Audio的代码已经包含在最新的 Hugging Face Transformers 的主分支中，我们建议您使用命令pip install git+https://github.com/huggingface/transformers

KeyError: 'qwen2-audio'

快速使用

我们提供简单的示例来说明如何利用 🤗 Transformers 快速使用 Qwen2-Audio-7B 和 Qwen2-Audio-7B-Instruct。在开始前，请确保你已经配置好环境并安装好相关的代码包。最重要的是，确保你满足上述要求，然后安装相关的依赖库。接下来你可以开始使用 Transformers 或者 ModelScope 来使用我们的模型。目前Qwen2-Audio-7B 及 Qwen2-Audio-7B-Instruct 模型处理30秒以内的音频表现更佳。

🤗 Hugging Face Transformers

如希望使用 Qwen2-Audio-7B-Instruct 进行推理，我们分别演示语音聊天和音频分析的交互方式，所需要写的只是如下所示的数行代码。

语音聊天推理

在语音聊天模式下，用户可以自由地与 Qwen2-Audio 进行语音交互，无需文字输入：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

音频分析推理

在音频分析中，用户可以提供音频和文字问题来实现对音频的分析：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

批量推理

我们也支持批量推理：

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

运行Qwen2-Audio-7B同样非常简单。

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")

generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

🤖 ModelScope

我们强烈建议用户，特别是中国大陆地区的用户，使用 ModelScope。snapshot_download 可以帮助您解决下载检查点时遇到的问题。

Demo

Web UI

我们提供了 Web UI 的 demo 供用户使用。在开始前，确保已经安装如下代码库：

pip install -r requirements_web_demo.txt

随后运行如下命令，并点击生成链接：

python demo/web_demo_audio.py

样例展示

更多样例将更新于通义千问博客中的 Qwen2-Audio 博客。

团队招聘

我们是通义千问语音多模态团队，致力于以通义千问为核心，拓展音频多模态理解和生成能力，实现自由灵活的音频交互。目前团队蓬勃发展中，如有意向实习或全职加入我们，请发送简历至 qwen_audio@list.alibaba-inc.com.

使用协议

请查看每个模型在其 Hugging Face 仓库中的许可证。您无需提交商业使用申请。

引用

如果你觉得我们的论文和代码对你的研究有帮助，请考虑 ⭐ 和引用 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}

@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

联系我们

如果你想给我们的研发团队和产品团队留言，请通过邮件（qianwen_opensource@alibabacloud.com）联系我们。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_CN.md

README_CN.md

模型结构与训练范式

新闻

评测

部署要求

快速使用

🤗 Hugging Face Transformers

语音聊天推理

音频分析推理

批量推理

🤖 ModelScope

Demo

Web UI

样例展示

团队招聘

使用协议

引用

联系我们

Files

README_CN.md

Latest commit

History

README_CN.md

File metadata and controls

模型结构与训练范式

新闻

评测

部署要求

快速使用

🤗 Hugging Face Transformers

语音聊天推理

音频分析推理

批量推理

🤖 ModelScope

Demo

Web UI

样例展示

团队招聘

使用协议

引用

联系我们