Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The llava-onevision model video inference code has an error #144

Open
AmazDeng opened this issue Aug 14, 2024 · 16 comments
Open

The llava-onevision model video inference code has an error #144

AmazDeng opened this issue Aug 14, 2024 · 16 comments

Comments

@AmazDeng
Copy link

AmazDeng commented Aug 14, 2024

For the llava-onevision model, the official video inference code does not modify the image_aspect_ratio parameter, resulting in the use of the default anyres_max_9. This causes the image_features to occupy a huge amount of GPU memory during inference. Is this problematic? After all, the paper states that each frame consists of 196 tokens, but using anyres_max_9 results in a number of tokens per frame far exceeding 196. Relevant links are as follows:

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb
#142

Additionally, why can't I see the logic for each frame corresponding to 196 tokens in the code?

@AmazDeng
Copy link
Author

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

@kcz358
Copy link
Collaborator

kcz358 commented Aug 14, 2024

I think there is a small error in the jupyter notebook. Passing modalities=['video'] should lower the token usage

@Luodian
Copy link
Contributor

Luodian commented Aug 14, 2024

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

Sorry we found that we wrongly added some video specific logics in our llava_arch.py in commit c121c20.

Now we revert it and please try with updated code, thanks!

@AmazDeng
Copy link
Author

AmazDeng commented Aug 15, 2024

@Luodian @ZhangYuanhan-AI @kcz358
Thank you for your response.
I need to point out that the reason for the excessively high GPU memory usage during video inference is that after the process_images method completes, the image_tensors dimensions are extremely large. For a single image, the dimensions are [16, 3, 384, 384], and for 32 frames, it becomes [512, 3, 384, 384]. This issue occurs at the stage where image_tensors = process_images(video_frames, image_processor, model.config) is executed, not during the generate stage. Therefore, even if you pass modalities=["video"] during the generate stage, it doesn’t help.

The reason the first dimension of the process_images output for a single image is 16 is that image_aspect_ratio="anyres_max_9".The "anyres_max_9" parameter is applicable to single image inference, not to video inference. I tested this using the latest code you modified, and the result is the same. GPU memory usage is still very high (about 57GB for 24 frames). The generated tensor does not have a shape of 196.
So, does the process_images method also need some modifications?

1
2
3
inference code

import argparse
import torch
import sys
# print(f"before,sys.path============={sys.path}")
sys.path.append("/media/star/8T/PycharmProjects/github/gpt/LLaVA-NeXT")
# print(f"after,sys.path============={sys.path}")
import time

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings

warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "/media/star/8T/model/gpt/llava/llava-next/lmms-lab/llava-onevision/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()


# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/media/star/8T/tmp/gpt4v/video/zouxiu2_5/clip_135_140.mp4"
num_frames=24


print(f"num_frames={num_frames}")
video_frames = extract_frames(video_path,num_frames=num_frames)
print(f"model.config={model.config}")
image_tensors = process_images(video_frames, image_processor, model.config)


image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
print(f"image_tensors.shape={[image_tensor.shape for image_tensor in image_tensors]}")
# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nIs the model changing clothes in the video? answer the question using a single word or phrase."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
print(f"image_sizes={image_sizes[:2]}")
# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

@kcz358
Copy link
Collaborator

kcz358 commented Aug 15, 2024

The 729 dimension
Hi, I think it is correct because in the encode image function there is no pooling operation. The pooling operation will be after the encode images in the get 2d pool and after this I think it will turn into 196 dim.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)


For the wrong process images for video

Yes, I agree with you. There is an error in the tutorial again. Should not use the process_image in the mm_utils, it is using much more tokens then expected for frames.

You should use image processor to handle the frame instead

image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().cuda()

Thank you for pointing it out and I will check with it later and revise the notebook

@AmazDeng
Copy link
Author

AmazDeng commented Aug 15, 2024

image_features.append(self.get_2dPool(image_feat))

The code you referenced is located in the encode_multimodals method. However, in the main branch of the llava_arch.py code, encode_multimodals is commented out. @kcz358

@kcz358
Copy link
Collaborator

kcz358 commented Aug 15, 2024

These lines contain the processing logic, not the encode_multimodals.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)

@AmazDeng
Copy link
Author

AmazDeng commented Aug 15, 2024

These lines contain the processing logic, not the encode_multimodals.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)

I did as you said and replaced "process_images" with "image_processor". I printed out the shape after the statement "image_features.append(self.get_2dPool(image_feat))", but still no 196 appeared.

I am using the llava-onevision-qwen2-7b-ov version and conducted both local and online tests on the same video (https://llava-onevision.lmms-lab.com/). The results were "yes" and "no," respectively. The prompt was "Is the model changing clothes in the video? Answer the question using a single word or phrase." Clearly, the online result was correct, the local result is was wrong.
Therefore, I think there are still some issues with the code.

1
2

@kcz358
Copy link
Collaborator

kcz358 commented Aug 15, 2024

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get
image

All the video frames are being pooled correctly. Hope it would help

@Luodian
Copy link
Contributor

Luodian commented Aug 15, 2024

Thank you Kaichen, it's great to see the problem has been addressed, also tested my side and it works.

@AmazDeng
Copy link
Author

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get image

All the video frames are being pooled correctly. Hope it would help

I understand now. In my original approach, I only passed in [video], so the video only read a single frame. The subsequent frames were all processed as images.

@hulianyuyy
Copy link

Many thanks for your question. In the toturial, it works normally. But in the video inference code upon evaluation benchmarks, would it still incur huge memory costs?

@kcz358
Copy link
Collaborator

kcz358 commented Aug 16, 2024

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

@hulianyuyy
Copy link

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

Thanks, the lmms_eval evaluation logic is correct. But when i evaluate with 7b modal, it still incurs ~70GB memory, which is too huge as LLAVA-Next-Video-7B only occupies ~20GB memory. Maybe there is still something wrong with the inference code?

@Luodian
Copy link
Contributor

Luodian commented Aug 16, 2024

image

I did a quick test, it runs in 20GB.

My script is:

FINAL_RUN_NAME=$1
TASKS=$2

MODEL_BASENAME=$(basename "$FINAL_RUN_NAME")

echo "MODEL_BASENAME: ${MODEL_BASENAME}"
cd /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval

python3 -m accelerate.commands.launch --num_processes 8 --main_process_port 12399 lmms_eval \
    --model llava_onevision \
    --model_args pretrained=${FINAL_RUN_NAME},conv_template=qwen_1_5,model_name=llava_qwen \
    --tasks ${TASKS} \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix ${MODEL_BASENAME} \
    --output_path ./logs

bash /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval/scripts/llava_one_vision/ov_eval.sh lmms-lab/llava-onevision-qwen2-7b-ov videomme;

@hulianyuyy
Copy link

Thanks for your reply. I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants