DeepSeek-VL2是一种基于大型混合专家(Mixture-of-Experts,MoE)视觉语言模型,相较于其前身DeepSeek-VL有了显著提升。DeepSeek-VL2在各种任务中展现出了卓越的能力,包括但不限于视觉问答、光学字符识别、文档/表格/图表理解以及视觉定位。我们的模型系列包含三种变体:DeepSeek-VL2-Tiny、DeepSeek-VL2-Small和DeepSeek-VL2,分别拥有10亿、28亿和45亿个激活参数。与现有的开源密集型和基于MoE的模型相比,DeepSeek-VL2在激活参数相似或更少的情况下,实现了具有竞争力甚至最先进的性能。 注:以上为 DeepSeek-VL2 的整体架构图引用自论文。
本仓库支持的模型权重:
Model |
---|
deepseek-ai/deepseek-vl2-tiny |
deepseek-ai/deepseek-vl2-small |
deepseek-ai/deepseek-vl2 |
注意:与huggingface权重同名,但权重为paddle框架的Tensor,使用xxx.from_pretrained("deepseek-ai/deepseek-vl2-tiny")
即可自动下载该权重文件夹到缓存目录。
2)pip install pillow tqdm paddlenlp==3.0.0b3
注意:Python版本最好为3.10及以上版本。
注:在V100上运行以下代码需要指定dtype="float16", 如果需要使用deepseek-vl2-small模型,需要修改model_path为"deepseek-ai/deepseek-vl2-small"
# Deepseek-vl2-tiny single image understanding
python paddlemix/examples/deepseek_vl2/single_image_infer.py \
--model_path="deepseek-ai/deepseek-vl2-tiny" \
--image_file="paddlemix/demo_images/examples_image2.jpg" \
--question="The Panda" \
--dtype="bfloat16"
# Deepseek-vl2-tiny multi image understanding
python paddlemix/examples/deepseek_vl2/multi_image_infer.py \
--model_path="deepseek-ai/deepseek-vl2-tiny" \
--image_file_1="paddlemix/demo_images/examples_image1.jpg" \
--image_file_2="paddlemix/demo_images/examples_image2.jpg" \
--image_file_3="paddlemix/demo_images/twitter3.jpeg" \
--question="Can you tell me what are in the images?" \
--dtype="bfloat16"
# Deepseek-vl2-tiny increment prefilling kv cache inference
python paddlemix/examples/deepseek_vl2/increment_prefilling_infer.py \
--model_path="deepseek-ai/deepseek-vl2-tiny" \
--image_file_1="paddlemix/demo_images/examples_image1.jpg" \
--image_file_2="paddlemix/demo_images/examples_image2.jpg" \
--image_file_3="paddlemix/demo_images/twitter3.jpeg" \
--question="Can you tell me what are in the images?" \
--dtype="bfloat16"
1) DeepSeek-VL2-tiny Single Image Understanding
<|User|>: <image>
<|ref|>The Panda<|/ref|>.
<|Assistant|>: <|ref|>The Panda<|/ref|><|det|>[[100, 192, 998, 998]]<|/det|><|end▁of▁sentence|>
- DeepSeek-VL2-tiny Multi Image Understanding
<|User|>: This is image_1: <image>
This is image_2: <image>
This is image_3: <image>
Can you tell me what are in the images?
<|Assistant|>: The first image shows a red panda resting on a wooden platform. The second image features a giant panda sitting among bamboo plants. The third image captures a rocket launch at night, with the bright trail of the rocket illuminating the sky.<|end▁of▁sentence|>
@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding},
author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
year={2024},
eprint={2412.10302},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.10302},
}