Skip to content

Latest commit

 

History

History
297 lines (240 loc) · 20.3 KB

README_CN.md

File metadata and controls

297 lines (240 loc) · 20.3 KB

中文  |  English



Qwen2-Audio-7B 🤖 | 🤗  | Qwen-Audio-7B-Instruct 🤖 | 🤗  | Demo 🤖 | 🤗 
📑 Paper    |    📑 Blog    |    💬 WeChat (微信)   |    Discord  

我们介绍Qwen-Audio的最新进展:Qwen2-Audio。作为一个大规模音频语言模型,Qwen2-Audio能够接受各种音频信号输入,并根据语音指令执行音频分析或直接响应文本。我们介绍两种不同的音频交互模式:语音聊天 voice chat 和音频分析 audio analysis。

  • 语音聊天:用户可以自由地与 Qwen2-Audio 进行语音互动,而无需文本输入;
  • 音频分析:用户可以在互动过程中提供音频和文本指令对音频进行分析;

我们已经开源了 Qwen2-Audio 系列的两个模型:Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct。

模型结构与训练范式

Qwen2-Audio 三阶段训练过程概述。

新闻

  • 2024.8.9 🎉 我们在 ModelScope 和 Hugging Face 开源了Qwen2-Audio-7BQwen2-Audio-7B-Instruct的 checkpoint.
  • 2024.7.15 🎉 我们发布了 Qwen2-Audio 的论文, 介绍了相关的模型结构,训练方法和模型表现。
  • 2023.11.30 🔥 我们发布了Qwen-Audio系列

评测

我们在标准的13个学术数据集上评测了模型的能力如下:

TaskDescriptionDatasetSplitMetric
ASRAutomatic Speech RecognitionFleursdev | testWER
Aishell2test
Librispeechdev | test
Common Voicedev | test
S2TTSpeech-to-Text TranslationCoVoST2testBLEU
SERSpeech Emotion RecognitionMeldtestACC
VSCVocal Sound ClassificationVocalSoundtestACC
AIR-Bench
Chat-Benchmark-SpeechFisher
SpokenWOZ
IEMOCAP
Common voice
dev | testGPT-4 Eval
Chat-Benchmark-SoundClothodev | testGPT-4 Eval
Chat-Benchmark-MusicMusicCapsdev | testGPT-4 Eval
Chat-Benchmark-Mixed-AudioCommon voice
AudioCaps
MusicCaps
dev | testGPT-4 Eval

以下是整体表现:

评测分数详情如下:
(注意:我们所展示的评测结果是在原始训练框架的初始模型上的,然而在框架转换 Huggingface 后指标出现了部分波动,在这里我们展示我们的全部测评结果:首先是论文中的初始模型结果)

TaskDatasetModelPerformance
MetricsResults
ASRLibrispeech
dev-clean | dev-other |
test-clean | test-other
SpeechT5WER 2.1 | 5.5 | 2.4 | 5.8
SpeechNet- | - | 30.7 | -
SLM-FT- | - | 2.6 | 5.0
SALMONN- | - | 2.1 | 4.9
SpeechVerse- | - | 2.1 | 4.4
Qwen-Audio1.8 | 4.0 | 2.0 | 4.2
Qwen2-Audio1.3 | 3.4 | 1.6 | 3.6
Common Voice 15
en | zh | yue | fr
Whisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8
Qwen2-Audio8.6 | 6.9 | 5.9 | 9.6
Fleurs
zh
Whisper-large-v3WER 7.7
Qwen2-Audio7.5
Aishell2
Mic | iOS | Android
MMSpeech-baseWER 4.5 | 3.9 | 4.0
Paraformer-large- | 2.9 | -
Qwen-Audio3.3 | 3.1 | 3.3
Qwen2-Audio3.0 | 3.0 | 2.9
S2TTCoVoST2
en-de | de-en |
en-zh | zh-en
SALMONNBLEU 18.6 | - | 33.1 | -
SpeechLLaMA- | 27.1 | - | 12.3
BLSP14.1 | - | - | -
Qwen-Audio25.1 | 33.9 | 41.5 | 15.7
Qwen2-Audio29.9 | 35.2 | 45.2 | 24.4
CoVoST2
es-en | fr-en | it-en |
SpeechLLaMABLEU 27.9 | 25.2 | 25.9
Qwen-Audio39.7 | 38.5 | 36.0
Qwen2-Audio40.0 | 38.5 | 36.3
SERMeldWavLM-largeACC 0.542
Qwen-Audio0.557
Qwen2-Audio0.553
VSCVocalSoundCLAPACC 0.4945
Pengi0.6035
Qwen-Audio0.9289
Qwen2-Audio0.9392
AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio
SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio
GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.18 | 6.99 | 6.79 | 6.77

(其次是转换 huggingface 后的)

TaskDatasetModelPerformance
MetricsResults
ASRLibrispeech
dev-clean | dev-other |
test-clean | test-other
SpeechT5WER 2.1 | 5.5 | 2.4 | 5.8
SpeechNet- | - | 30.7 | -
SLM-FT- | - | 2.6 | 5.0
SALMONN- | - | 2.1 | 4.9
SpeechVerse- | - | 2.1 | 4.4
Qwen-Audio1.8 | 4.0 | 2.0 | 4.2
Qwen2-Audio1.7 | 3.6 | 1.7 | 4.0
Common Voice 15
en | zh | yue | fr
Whisper-large-v3WER 9.3 | 12.8 | 10.9 | 10.8
Qwen2-Audio8.7 | 6.5 | 5.9 | 9.6
Fleurs
zh
Whisper-large-v3WER 7.7
Qwen2-Audio7.0
Aishell2
Mic | iOS | Android
MMSpeech-baseWER 4.5 | 3.9 | 4.0
Paraformer-large- | 2.9 | -
Qwen-Audio3.3 | 3.1 | 3.3
Qwen2-Audio3.2 | 3.1 | 2.9
S2TTCoVoST2
en-de | de-en |
en-zh | zh-en
SALMONNBLEU 18.6 | - | 33.1 | -
SpeechLLaMA- | 27.1 | - | 12.3
BLSP14.1 | - | - | -
Qwen-Audio25.1 | 33.9 | 41.5 | 15.7
Qwen2-Audio29.6 | 33.6 | 45.6 | 24.0
CoVoST2
es-en | fr-en | it-en |
SpeechLLaMABLEU 27.9 | 25.2 | 25.9
Qwen-Audio39.7 | 38.5 | 36.0
Qwen2-Audio38.7 | 37.2 | 35.2
SERMeldWavLM-largeACC 0.542
Qwen-Audio0.557
Qwen2-Audio0.535
VSCVocalSoundCLAPACC 0.4945
Pengi0.6035
Qwen-Audio0.9289
Qwen2-Audio0.9395
AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio
SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio
GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.24 | 6.83 | 6.73 | 6.42

我们提供了以上所有评测脚本以供复现我们的实验结果。请阅读 eval_audio/EVALUATION.md 了解更多信息。

部署要求

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error: Qwen2-Audio的代码已经包含在最新的 Hugging Face Transformers 的主分支中,我们建议您使用命令pip install git+https://github.com/huggingface/transformers

KeyError: 'qwen2-audio'

快速使用

我们提供简单的示例来说明如何利用 🤗 Transformers 快速使用 Qwen2-Audio-7B 和 Qwen2-Audio-7B-Instruct。 在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 接下来你可以开始使用 Transformers 或者 ModelScope 来使用我们的模型。目前Qwen2-Audio-7B 及 Qwen2-Audio-7B-Instruct 模型处理30秒以内的音频表现更佳。

🤗 Hugging Face Transformers

如希望使用 Qwen2-Audio-7B-Instruct 进行推理,我们分别演示语音聊天和音频分析的交互方式,所需要写的只是如下所示的数行代码。

语音聊天推理

在语音聊天模式下,用户可以自由地与 Qwen2-Audio 进行语音交互,无需文字输入:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
音频分析推理

在音频分析中,用户可以提供音频和文字问题来实现对音频的分析:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
批量推理

我们也支持批量推理:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

运行Qwen2-Audio-7B同样非常简单。

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")

generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

🤖 ModelScope

我们强烈建议用户,特别是中国大陆地区的用户,使用 ModelScope。snapshot_download 可以帮助您解决下载检查点时遇到的问题。

Demo

Web UI

我们提供了 Web UI 的 demo 供用户使用。在开始前,确保已经安装如下代码库:

pip install -r requirements_web_demo.txt

随后运行如下命令,并点击生成链接:

python demo/web_demo_audio.py

样例展示

更多样例将更新于通义千问博客中的 Qwen2-Audio 博客。

团队招聘

我们是通义千问语音多模态团队,致力于以通义千问为核心,拓展音频多模态理解和生成能力,实现自由灵活的音频交互。目前团队蓬勃发展中,如有意向实习或全职加入我们,请发送简历至 qwen_audio@list.alibaba-inc.com.

使用协议

请查看每个模型在其 Hugging Face 仓库中的许可证。您无需提交商业使用申请。

引用

如果你觉得我们的论文和代码对你的研究有帮助,请考虑 ⭐ 和引用 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}
@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

联系我们

如果你想给我们的研发团队和产品团队留言,请通过邮件(qianwen_opensource@alibabacloud.com)联系我们。