mplug-owl3-7b-chat fine-tuning document #1969

Jintao-Huang · 2024-09-07T05:13:01Z

Model:

ModelScope: https://www.modelscope.cn/models/iic/mPLUG-Owl3-7B-240728
Huggingface: https://huggingface.co/mPLUG/mPLUG-Owl3-7B-240728

Usually, fine-tuning a multimodal large model involves using a custom dataset for fine-tuning. Here, we will demonstrate a runnable demo.

Fine-tuned Dataset:

Before starting the fine-tuning, please ensure that your environment is properly prepared.

git clone https://github.com/modelscope/ms-swift.git
cd swift
pip install -e .[llm]

pip install decord icecream

Inference

# ModelScope
CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type mplug-owl3-7b-chat \
  --model_id_or_path iic/mPLUG-Owl3-7B-240728 \

# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0 swift infer \
  --model_type mplug-owl3-7b-chat \
  --model_id_or_path mPLUG/mPLUG-Owl3-7B-240728 \

Results

<<< who are you
I am an AI language model, designed to assist with a variety of tasks such as answering questions and providing information. I do not have a physical form, but rather exist as a program running on a computer. Is there anything specific you would like me to help you with?
--------------------------------------------------
<<< <image>describe the image
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
[INFO:swift] Setting max_num_frames: 16. You can adjust this hyperparameter through the environment variable: `MAX_NUM_FRAMES`.
This is a very cute photo of a kitten! The kitten has beautiful blue eyes and a very fluffy coat. It's adorable to see how it looks at the camera. The colors in the photo are very natural and well-balanced, which adds to the overall cuteness of the image. Great job capturing this adorable moment!
--------------------------------------------------
<<< clear
<<< <video>describe the video
Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
The video captures a young child's interest in reading and learning, as they are seen sitting on a bed and flipping through the pages of a book while wearing glasses. The child appears to be engaged and curious about the content of the book.

GPU Memory:

The text was updated successfully, but these errors were encountered:

Jintao-Huang · 2024-09-07T05:44:54Z

image fine-tuning

The format of the custom dataset is as follows (single image, multiple images, and no image):

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee<image>eeeee<image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "images": []}

Fine-tuning script:

# ModelScope
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type mplug-owl3-7b-chat \
  --model_id_or_path iic/mPLUG-Owl3-7B-240728 \
  --sft_type lora \
  --dataset coco-en-mini#20000 \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5

If you want to use a custom dataset, simply specify as follows:

  --dataset train.jsonl \
  --val_dataset val.jsonl \

Here is the inference script after fine-tuning, we perform inference on the automatically segmented validation set:

# If using HuggingFace, please add: `USE_HF=1`
# inference only
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/mplug-owl3-7b-chat/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --show_dataset_sample 10

# merge-lora & inference
CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/mplug-owl3-7b-chatt/vx-xxx/checkpoint-xxx \
    --load_dataset_config true --merge_lora true --show_dataset_sample 10

video fine-tuning

The format of the custom dataset is as follows:

{"query": "<video>55555", "response": "66666", "videos": ["video_path"]}
{"query": "eeeee<video>eeeee<video>eeeee", "response": "fffff", "history": [], "videos": ["video_path1", "video_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response2"], ["query2", "response2"]], "videos": []}

Fine-tuning script:

# ModelScope
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \
  --model_type mplug-owl3-7b-chat \
  --model_id_or_path iic/mPLUG-Owl3-7B-240728 \
  --sft_type lora \
  --dataset video-chatgpt \
  --deepspeed default-zero2 \
  --output_dir output \
  --num_train_epochs 5

ozhyo · 2024-09-13T08:09:48Z

你好，试了下image fine-tuning的示例代码，发现训练过程中模型并没有用到image和media offset，data_collator貌似忽略了这两个值，导致模型在forward的过程中并没有用到图像

Jintao-Huang · 2024-09-14T05:19:51Z

我修复一下

Jintao-Huang · 2024-09-17T01:41:04Z

你好，试了下image fine-tuning的示例代码，发现训练过程中模型并没有用到image和media offset，data_collator貌似忽略了这两个值，导致模型在forward的过程中并没有用到图像

fixed

ozhyo · 2024-09-19T02:26:04Z

你好，感谢修复。
我测试了下，现在已经能利用image和media offset进行模型forward和训练了，但是只能开启batch_size=1，貌似因为media offset并没有进行padding，导致长度不匹配从而无法组成batch

goodstudent9 · 2024-09-22T07:32:37Z

确实，我也试了，不能开起batchsize = 2
[rank0]: Original Traceback (most recent call last): [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank0]: return self.collate_fn(data) [rank0]: File "/home/project/ruohangxu/ms-swift/swift/llm/utils/template.py", line 3318, in data_collator [rank0]: res['media_offset'] = torch.concat(media_offset) [rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 122 but got size 124 for tensor number 1 in the list.

Jintao-Huang · 2024-09-22T15:34:51Z

确实，我也试了，不能开起batchsize = 2 [rank0]: Original Traceback (most recent call last): [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank0]: return self.collate_fn(data) [rank0]: File "/home/project/ruohangxu/ms-swift/swift/llm/utils/template.py", line 3318, in data_collator [rank0]: res['media_offset'] = torch.concat(media_offset) [rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 122 but got size 124 for tensor number 1 in the list.

是的不支持batch_size=2, 我也不知道该如何支持。我尝试使用padding，但会在owl3的代码中抛错误.

goodstudent9 · 2024-09-22T15:37:23Z

我把代码改成可以支持并行了，但是我不太确信效果如何，我正在验证我的代码和batch size=1时候在你提供的那个训练集上面的效果，如果可以的话，我再来和你交流哈

…

---Original--- From: ***@***.***> Date: Sun, Sep 22, 2024 23:35 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [modelscope/ms-swift] mplug-owl3-7b-chat fine-tuning document(Issue #1969) 确实，我也试了，不能开起batchsize = 2 [rank0]: Original Traceback (most recent call last): [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank0]: return self.collate_fn(data) [rank0]: File "/home/project/ruohangxu/ms-swift/swift/llm/utils/template.py", line 3318, in data_collator [rank0]: res['media_offset'] = torch.concat(media_offset) [rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 122 but got size 124 for tensor number 1 in the list. 是的不支持batch_size=2, 我也不知道该如何支持。我尝试使用padding，但会在owl3的代码中抛错误. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

goodstudent9 · 2024-09-22T15:42:39Z

所以那个padding标记到底是什么呢？就是你在训练的时候用的这个media offset的padding

…

---Original--- From: ***@***.***> Date: Sun, Sep 22, 2024 23:35 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [modelscope/ms-swift] mplug-owl3-7b-chat fine-tuning document(Issue #1969) 确实，我也试了，不能开起batchsize = 2 [rank0]: Original Traceback (most recent call last): [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/home/project/tools/anaconda3/envs/owl3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank0]: return self.collate_fn(data) [rank0]: File "/home/project/ruohangxu/ms-swift/swift/llm/utils/template.py", line 3318, in data_collator [rank0]: res['media_offset'] = torch.concat(media_offset) [rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 122 but got size 124 for tensor number 1 in the list. 是的不支持batch_size=2, 我也不知道该如何支持。我尝试使用padding，但会在owl3的代码中抛错误. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Jintao-Huang · 2024-09-22T15:50:50Z

我把代码改成可以支持并行了，但是我不太确信效果如何，我正在验证我的代码和batch size=1时候在你提供的那个训练集上面的效果，如果可以的话，我再来和你交流哈

如果发现效果可以，欢迎提供PR哈

goodstudent9 · 2024-09-23T01:27:16Z

你那边有你提供的这个脚本的实验结果吗？可以分享一下coco-en-mini#20000数据集的结果来做一个比较吗？我说的脚本是指： CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft \ --model_type mplug-owl3-7b-chat \ --model_id_or_path iic/mPLUG-Owl3-7B-240728 \ --sft_type lora \ --dataset coco-en-mini#20000 \ --deepspeed default-zero2 \ --output_dir output \ --num_train_epochs 5   onlysword ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年9月22日(星期天) 晚上11:51 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [modelscope/ms-swift] mplug-owl3-7b-chat fine-tuning document (Issue #1969) 我把代码改成可以支持并行了，但是我不太确信效果如何，我正在验证我的代码和batch size=1时候在你提供的那个训练集上面的效果，如果可以的话，我再来和你交流哈如果发现效果可以，欢迎提供PR哈 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

ozhyo · 2024-09-23T02:26:20Z

有图的时候padding [0, 0]，无图的时候则全为[0, -1000000]（这里任意负值都可以），是这样吗？

goodstudent9 · 2024-09-23T03:38:52Z

论文里有提到0，0 padding吗？我咋没读到哎。而且就算在有图的时候给了0，0的padding，会报错的，我已经试过了我现在做了就是把最后一个token扩充到padding的长度，也就是batch的最长长度，代码可以正常运行，但是实验效果很不好。所以我猜测作者原来训练肯定不是复制序列末尾元素的方式来padding，但是现在无论是负数padding还是0padding我都试了，都会报错。只能问问作者了。   onlysword ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "Huayu ***@***.***>; 发送时间: 2024年9月23日(星期一) 上午10:26 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [modelscope/ms-swift] mplug-owl3-7b-chat fine-tuning document (Issue #1969) 按照论文的说法，有图的时候padding [0, 0]，无图的时候则全为[0, -1000000]（这里任意负值都可以]，是这样吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Jintao-Huang · 2024-09-23T05:51:50Z

PR: #2100

goodstudent9 · 2024-09-29T03:55:03Z

Found some bugs when I do full parameters finetune.
Mplug-owl3 means a lot to me, thank you so much for your valuable help!

#2158

goodstudent9 · 2024-09-29T08:01:38Z

Found some bugs when I do full parameters finetune. Mplug-owl3 means a lot to me, thank you so much for your valuable help!

#2158

solved!

goodstudent9 · 2024-09-30T03:52:37Z

#2172 (comment)
Error will occur when inference with images.

Jintao-Huang added the good first issue Good for newcomers label Sep 7, 2024

Jintao-Huang mentioned this issue Sep 7, 2024

🍭Fine-tuning support for mPLUG-Owl3 X-PLUG/mPLUG-Owl#241

Open

Jintao-Huang changed the title ~~mplug-owl3-7b-chat fine-tuning best practices.~~ mplug-owl3-7b-chat fine-tuning document Sep 7, 2024

Jintao-Huang added the bug Something isn't working label Sep 14, 2024

Jintao-Huang mentioned this issue Sep 14, 2024

fix mplug-owl3 #2042

Merged

1 task

Jintao-Huang removed the bug Something isn't working label Sep 17, 2024

Jintao-Huang mentioned this issue Sep 23, 2024

请问在padding media offset的时候用的数字是多少呢？ #2092

Closed

LukeForeverYoung mentioned this issue Sep 23, 2024

How to train mPLUG-Owl3? X-PLUG/mPLUG-Owl#240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mplug-owl3-7b-chat fine-tuning document #1969

mplug-owl3-7b-chat fine-tuning document #1969

Jintao-Huang commented Sep 7, 2024 •

edited

Loading

Jintao-Huang commented Sep 7, 2024

ozhyo commented Sep 13, 2024

Jintao-Huang commented Sep 14, 2024

Jintao-Huang commented Sep 17, 2024

ozhyo commented Sep 19, 2024

goodstudent9 commented Sep 22, 2024

Jintao-Huang commented Sep 22, 2024

goodstudent9 commented Sep 22, 2024 via email

goodstudent9 commented Sep 22, 2024 via email

Jintao-Huang commented Sep 22, 2024

goodstudent9 commented Sep 23, 2024 via email

ozhyo commented Sep 23, 2024 •

edited

Loading

goodstudent9 commented Sep 23, 2024 via email

Jintao-Huang commented Sep 23, 2024

goodstudent9 commented Sep 29, 2024

goodstudent9 commented Sep 29, 2024

goodstudent9 commented Sep 30, 2024

mplug-owl3-7b-chat fine-tuning document #1969

mplug-owl3-7b-chat fine-tuning document #1969

Comments

Jintao-Huang commented Sep 7, 2024 • edited Loading

Inference

Jintao-Huang commented Sep 7, 2024

image fine-tuning

video fine-tuning

ozhyo commented Sep 13, 2024

Jintao-Huang commented Sep 14, 2024

Jintao-Huang commented Sep 17, 2024

ozhyo commented Sep 19, 2024

goodstudent9 commented Sep 22, 2024

Jintao-Huang commented Sep 22, 2024

goodstudent9 commented Sep 22, 2024 via email

goodstudent9 commented Sep 22, 2024 via email

Jintao-Huang commented Sep 22, 2024

goodstudent9 commented Sep 23, 2024 via email

ozhyo commented Sep 23, 2024 • edited Loading

goodstudent9 commented Sep 23, 2024 via email

Jintao-Huang commented Sep 23, 2024

goodstudent9 commented Sep 29, 2024

goodstudent9 commented Sep 29, 2024

goodstudent9 commented Sep 30, 2024

Jintao-Huang commented Sep 7, 2024 •

edited

Loading

ozhyo commented Sep 23, 2024 •

edited

Loading