-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: assert len(indices) == len(inputs) with Qwen/Qwen2-VL-2B-Instruct
#9128
Comments
You should follow the prompt template as shown on their HuggingFace repo. The easiest way is to use |
If I were to use |
Please read the examples on their HuggingFace repo. The format of |
Modified my code: from vllm import LLM, SamplingParams
from PIL import Image
if __name__ == "__main__":
vllm_engine = LLM("Qwen/Qwen2-VL-2B-Instruct")
sampling_params = SamplingParams(max_tokens=120)
num_images = 3
messages = [{"role": "user", "content": []}]
for _ in range(num_images):
new_image = {"type": "image", "image": Image.new("RGB", (224, 224))}
messages[0]["content"].append(new_image)
messages[0]["content"].append({"type": "text", "text": "Describe this image."})
outputs = vllm_engine.chat(messages, sampling_params)
print(outputs) Now it leads to: [rank0]: Traceback (most recent call last):
[rank0]: File "/home/sayak/diffusers/check_video_vllm.py", line 32, in <module>
[rank0]: outputs = vllm_engine.chat(messages, sampling_params)
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/llm.py", line 556, in chat
[rank0]: conversation, mm_data = parse_chat_messages(
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 487, in parse_chat_messages
[rank0]: sub_messages = _parse_chat_message_content(msg, mm_tracker)
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 440, in _parse_chat_message_content
[rank0]: result = _parse_chat_message_content_parts(
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 402, in _parse_chat_message_content_parts
[rank0]: raise NotImplementedError(f"Unknown part type: {part_type}")
[rank0]: NotImplementedError: Unknown part type: image |
for _ in range(num_images):
new_image = {"type": "image", "image": Image.new("RGB", (224, 224))}
messages[0]["content"].append(new_image) Instead of |
Alright. I will try that and report here but I think #9128 (comment) is misleading then. I did follow the exact "messages" formatting. |
Sorry, I missed the slight difference in the image format. Hope that everything is cleared up now! |
With def encode_image(image):
buffered = io.BytesIO()
image.save(buffered, format="JPEG")
image_bytes = buffered.getvalue()
return base64.b64encode(image_bytes).decode("utf-8")
num_images = 3
messages = [{"role": "user", "content": []}]
for _ in range(num_images):
base64_image = encode_image(Image.new("RGB", (224, 224)))
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
messages[0]["content"].append(new_image)
messages[0]["content"].append({"type": "text", "text": "Describe this image."}) I get: [rank0]: Traceback (most recent call last):
[rank0]: File "/home/sayak/diffusers/check_video_vllm.py", line 41, in <module>
[rank0]: outputs = vllm_engine.chat(messages, sampling_params)
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/llm.py", line 556, in chat
[rank0]: conversation, mm_data = parse_chat_messages(
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 487, in parse_chat_messages
[rank0]: sub_messages = _parse_chat_message_content(msg, mm_tracker)
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 440, in _parse_chat_message_content
[rank0]: result = _parse_chat_message_content_parts(
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 392, in _parse_chat_message_content_parts
[rank0]: mm_parser.parse_image(image_url["url"])
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 276, in parse_image
[rank0]: placeholder = self._tracker.add("image", image)
[rank0]: File "/home/sayak/vllm/vllm/entrypoints/chat_utils.py", line 205, in add
[rank0]: raise ValueError(
[rank0]: ValueError: At most 1 image(s) may be provided in one request. What am I missing out on? |
Please check our docs on using VLMs. There is a section on how to input multiple images per prompt. |
I am following https://docs.vllm.ai/en/stable/models/vlm.html and I am still really not sure what I am missing out here. A bit more specificity in your replies would be appreciated. |
You need to set |
Thanks! This is my full example and it works: from vllm import LLM, SamplingParams
from decord import VideoReader, cpu
from PIL import Image
import base64
import io
from huggingface_hub import hf_hub_download
NUM_MAX_FRAMES = 4
def encode_image(image):
buffered = io.BytesIO()
image.save(buffered, format="JPEG")
image_bytes = buffered.getvalue()
return base64.b64encode(image_bytes).decode("utf-8")
def load_video(num_max_frames=4):
video_filepath = hf_hub_download(
repo_id="huggingface/documentation-images", repo_type="dataset", filename="diffusers/hiker.mp4"
)
vr = VideoReader(video_filepath, ctx=cpu(0))
video_frames = [Image.fromarray(vr[i].asnumpy()) for i in range(len(vr))][:num_max_frames]
return video_frames
if __name__ == "__main__":
# Use a limit of 4 frames.
vllm_engine = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": NUM_MAX_FRAMES})
sampling_params = SamplingParams(max_tokens=120)
# Video.
video_frames = load_video(num_max_frames=NUM_MAX_FRAMES)
messages = [{"role": "user", "content": []}]
messages[0]["content"].append({"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."})
for i in range(len(video_frames)):
base64_image = encode_image(video_frames[i])
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
messages[0]["content"].append(new_image)
outputs = vllm_engine.chat(messages, sampling_params)
with open("qwen.txt", "w") as f:
print(outputs[0].outputs[0].text, file=f) The script shows inferencing on videos. Do you think it could be made a part of the docs? |
Feel free to open a PR. Do note however that this case is still technically multi-image input (we have different API for single-video and multi-video input, and currently it is limited to offline inference only, unlike multi-image case) |
Sure! Do you have a link to the documentation where you think this would be the most suitable? I think we could include this example in https://docs.vllm.ai/en/latest/models/vlm.html#multi-image-input it self. WDYT?
Do you have a link for me to look further? |
Yes, that sounds good.
See #7558 |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
Trying to run:
Leads to:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: