Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incoherent Offline Inference Single Video with Qwen2-VL #9723

Open
1 task done
hector-gr opened this issue Oct 26, 2024 · 17 comments
Open
1 task done

[Bug]: Incoherent Offline Inference Single Video with Qwen2-VL #9723

hector-gr opened this issue Oct 26, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@hector-gr
Copy link

Your current environment

The output of `python collect_env.py`
Collecting environment information...                                                           
PyTorch version: 2.4.0+cu121                                                                    
Is debug build: False                                                                           
CUDA used to build PyTorch: 12.1                                                                
ROCM used to build PyTorch: N/A                                                                 
                                                                                                
OS: Ubuntu 20.04.6 LTS (x86_64)                                                                 
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                                              
Clang version: Could not collect                                                                
CMake version: version 3.22.4                                                                   
Libc version: glibc-2.31                                                                        
                                                                                                
Python version: 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] (64-bit runtime)             
Python platform: Linux-4.18.0-513.5.1.el8_9.x86_64-x86_64-with-glibc2.31                        
Is CUDA available: True                                                                         
CUDA runtime version: 11.8.89                                                                   
CUDA_MODULE_LOADING set to: LAZY                                                                
GPU models and configuration:                                                                   
GPU 0: NVIDIA A100-SXM4-80GB                                                                    
GPU 1: NVIDIA A100-SXM4-80GB                                                                    
GPU 2: NVIDIA A100-SXM4-80GB                                                                    
GPU 3: NVIDIA A100-SXM4-80GB                                                                    
                                                                                                
Nvidia driver version: 555.42.06                                                                
cuDNN version: Probably one of the following:                                                   
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6                                                     
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6                                           
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6                                           
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6                                           
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6                                           
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6                                           
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6                                           
HIP runtime version: N/A                                                                        
MIOpen runtime version: N/A                                                                     
Is XNNPACK available: True                                                                      
                                                                                                
CPU:                                                                                            
Architecture:                       x86_64                                                      
CPU op-mode(s):                     32-bit, 64-bit                                              
Byte Order:                         Little Endian                                               
Address sizes:                      48 bits physical, 48 bits virtual                           
CPU(s):                             32                                                          
On-line CPU(s) list:                0-31                                                        
Thread(s) per core:                 1                                                           
Core(s) per socket:                 16                                                          
Socket(s):                          2                                                           
NUMA node(s):                       8                                                           
Vendor ID:                          AuthenticAMD                                                
CPU family:                         25                                                          
Model:                              1                                                           
Model name:                         AMD EPYC 7313 16-Core Processor                             
Stepping:                           1
CPU MHz:                            3517.887
BogoMIPS:                           5988.81
Virtualization:                     AMD-V
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           16 MiB
L3 cache:                           256 MiB
NUMA node0 CPU(s):                  0-3
NUMA node1 CPU(s):                  4-7
NUMA node2 CPU(s):                  8-11
NUMA node3 CPU(s):                  12-15
NUMA node4 CPU(s):                  16-19
NUMA node5 CPU(s):                  20-23
NUMA node6 CPU(s):                  24-27
NUMA node7 CPU(s):                  28-31
Vulnerability Gather data sampling: Not affected 
Vulnerability Itlb multihit:        Not affected 
Vulnerability L1tf:                 Not affected 
Vulnerability Mds:                  Not affected 
Vulnerability Meltdown:             Not affected 
Vulnerability Mmio stale data:      Not affected 
Vulnerability Retbleed:             Not affected 
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sani
tization                                                                                        
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP di$
abled, RSB filling, PBRSB-eIBRS Not affected                                                    
Vulnerability Srbds:                Not affected                                                
Vulnerability Tsx async abort:      Not affected                                                
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmo
v pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_t
sc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 p
cid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_lega
cy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb b
pext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmc
all fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni x
saveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xs
aveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
 decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpcl
mulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0+cu121
[pip3] torchvision==0.19.0+cu121
[pip3] transformers==4.46.0.dev0
[pip3] transformers-stream-generator==0.0.4
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0          py311h7deb3e3_3    conda-forge
[conda] torch                     2.4.0+cu121              pypi_0    pypi
[conda] torchvision               0.19.0+cu121             pypi_0    pypi
[conda] transformers              4.46.0.dev0              pypi_0    pypi
[conda] transformers-stream-generator 0.0.4                    pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity  G
PU NUMA ID
GPU0     X      NV12    NV12    NV12    SYS     PXB     SYS     SYS     12-15   3              N
/A
GPU1    NV12     X      NV12    NV12    PXB     SYS     SYS     SYS     4-7     1              N
/A
GPU2    NV12    NV12     X      NV12    PXB     SYS     SYS     SYS     4-7     1              N
/A
GPU3    NV12    NV12    NV12     X      SYS     SYS     SYS     PXB     28-31   7              N
/A
NIC0    SYS     PXB     PXB     SYS      X      SYS     SYS     SYS
NIC1    PXB     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC3    SYS     SYS     SYS     PXB     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QP
I/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within
 a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
  
  
NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

Model Input Dumps

No response

🐛 Describe the bug

I get incoherent generation outputs when using offline vLLM for inference with videos. This happens both when using URL or local paths, with 7B or 72B model, with or without tensor parallelism. The setup works well (provides coherent answers) when providing also text or text+image, but not video. This are also very different from the generated outputs when using transformers with the same arguments.

The code below follows the example on the Qwen repo (https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#inference-locally), but is also what seems to be recommended in vLLM docs

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
    # tensor_parallel_size=4,
    tensor_parallel_size=1,
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://ptchallenge-workshop.github.io/media/vis.mp4",
                "min_pixels": 224 * 224,
                # "max_pixels": 1280 * 28 * 28,
                "total_pixels": 16384 * 28 * 28,
                "fps": 2.0,
            },
            {"type": "text", "text": "Describe the video."},
        ],
    },
]
# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print("#"*50 + "\n" + "Qwen repo Video url output with total_pixels:", generated_text)

with output:

INFO 10-26 20:24:11 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
[rank0]:[W1026 20:24:28.958384079 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 10-26 20:24:28 model_runner.py:1056] Starting to load model Qwen/Qwen2-VL-7B-Instruct...
INFO 10-26 20:24:30 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:07,  1.96s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:06<00:09,  3.27s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:10<00:07,  3.56s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:13<00:03,  3.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:24<00:00,  6.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:24<00:00,  4.90s/it]

INFO 10-26 20:24:56 model_runner.py:1067] Loading model weights took 15.5083 GB
INFO 10-26 20:25:06 gpu_executor.py:122] # GPU blocks: 56587, # CPU blocks: 4681
INFO 10-26 20:25:06 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 27.63x
INFO 10-26 20:25:09 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-26 20:25:09 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-26 20:25:34 model_runner.py:1523] Graph capturing finished in 25 secs.
Processed prompts: 100%|█| 1/1 [00:05<00:00,  5.39s/it, est. speed input: 2541.48 toks/s, output
##################################################
Qwen repo Video url output with total_pixels: The: in helpful photo helpful image: Image helpful helpful

For transformers the code is the default shown in the Qwen repo, which is indeed very similar. I tried to check through other issues and commits, and from my understanding this feature is supported, and the only difference in implementations seem to be minimal (#8408 (comment))

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@hector-gr hector-gr added the bug Something isn't working label Oct 26, 2024
@DarkLight1337
Copy link
Member

@alex-jw-brooks can you add this model to your test suite to check whether the current model implementation is ok?

@SinanAkkoyun
Copy link

@hector-gr Did you manage to get coherent single image inference generation or do you also experience the same issue (#9732) there?

@hector-gr
Copy link
Author

hector-gr commented Oct 29, 2024

Single image inference works fine in this setup. Note that I only tried a few small images, so it might be related to your issue.

@bhavyajoshi-mahindra
Copy link

bhavyajoshi-mahindra commented Oct 29, 2024

Did you installed vllm from pip or from source code??
Please share details on how did you installed vllm.
I am trying to infer my custom Qwen2-VL GPTQ 4bit model using VLLM.
Thanks a lot

@SinanAkkoyun
Copy link

@hector-gr Can you please try a 4k or 5120x1440 image? :)

@SinanAkkoyun
Copy link

@bhavyajoshi-mahindra Do you experience the same issue as me and need to run large images or is your model even incoherent for small images?

I simply installed it via pip abd installed the qwen utils. However if you are just interested in quickly deploying your model, definitely take a look at the Qwen2VL repo. They give you commands on how to make it work

@bhavyajoshi-mahindra
Copy link

bhavyajoshi-mahindra commented Oct 29, 2024

I did went through Qwen2-vl repo, tried exactly the same as mentioned. But I got this error

No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-06a68f93c72a> in <cell line: 7>()
      5 MODEL_PATH = "/content/drive/MyDrive/LLM/vinplate2-gwen2-vl-gptq-4bit"
      6 
----> 7 llm = LLM(
      8     model=MODEL_PATH,
      9     limit_mm_per_prompt={"image": 10, "video": 10},

5 frames
/usr/local/lib/python3.10/dist-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len)
   1762                 scaling_factor = 1
   1763             else:
-> 1764                 assert "factor" in rope_scaling
   1765                 scaling_factor = rope_scaling["factor"]
   1766             if rope_type == "yarn":

AssertionError: 

Thats why I want to know which version of vllm and transformers are to be used and how to install them (from pip or source) in order to infer my custom qwen2-vl gptq 4 bit model for single image.

@DarkLight1337
Copy link
Member

I did went through Gwen2-vl repo, tried exactly the same as mentioned. But I got this error

No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-06a68f93c72a> in <cell line: 7>()
      5 MODEL_PATH = "/content/drive/MyDrive/LLM/vinplate2-gwen2-vl-gptq-4bit"
      6 
----> 7 llm = LLM(
      8     model=MODEL_PATH,
      9     limit_mm_per_prompt={"image": 10, "video": 10},

5 frames
/usr/local/lib/python3.10/dist-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len)
   1762                 scaling_factor = 1
   1763             else:
-> 1764                 assert "factor" in rope_scaling
   1765                 scaling_factor = rope_scaling["factor"]
   1766             if rope_type == "yarn":

AssertionError: 

Thats why I want to know which version of vllm and transformers are to be used and how to install them (from pip or source) in order to infer my custom qwen2-vl gptq 4 bit model for single image.

You should use either vLLM v0.6.1 and transformers v4.44, or vllm v0.6.3 and transformers v4.45+.

@hector-gr
Copy link
Author

hector-gr commented Oct 29, 2024

@hector-gr Can you please try a 4k or 5120x1440 image? :)

I used a 5472 × 3648 jpg from https://www.pexels.com/photo/brown-and-green-mountain-view-photo-842711/ and it seems to work fine. I also tried with a different 5120x1400 jpg and it worked.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": os.path.expanduser("~/workspace/pexels-christian-heitz-285904-842711.jpg"),
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Describe the image."},
        ],
    },
]

with output

Qwen repo Image (5472 × 3648) local output: The image depicts a picturesque rural landscape during what appears to be either sunrise or sunset, given the warm hues of orange and yellow that dominate the sky. The scene is characterized by rolling hills covered in lush green vineyards, which are meticulously arranged in terraced fields. The vineyards extend across the landscape, creating a patchwork of green and brown patches as some areas are still in the process of being harvested.

@hector-gr
Copy link
Author

Note downgrading to

vllm                      0.6.1                    pypi_0    pypi
vllm-flash-attn           2.6.1                    pypi_0    pypi
transformers              4.45.0.dev0              pypi_0    pypi

allows for coherent generation after the video input.

@SinanAkkoyun
Copy link

@hector-gr Thank you for testing! :) May I ask you how you installed vLLM? I am assuming your test ran on 0.6.3?

And could you please test the same 4k and 5120x1440 images with then OpenAI API endpoint? (with latest vLLM)

And may I ask how long the 4k image took for you to process? In my testing small images process in 200ms but 4k ones take several seconds

@bhavyajoshi-mahindra
Copy link

bhavyajoshi-mahindra commented Oct 29, 2024

Note downgrading to

vllm                      0.6.1                    pypi_0    pypi
vllm-flash-attn           2.6.1                    pypi_0    pypi
transformers              4.45.0.dev0              pypi_0    pypi

allows for coherent generation after the video input.

Can you please mention CUDA, Torch and python version as well.
Also I am working in Windows, can you tell me how to exactly install VLLM.
Thanks a lot

@Wiselnn570
Copy link

@hector-gr Seemingly

Your current environment

The output of python collect_env.py

Model Input Dumps

No response

🐛 Describe the bug

I get incoherent generation outputs when using offline vLLM for inference with videos. This happens both when using URL or local paths, with 7B or 72B model, with or without tensor parallelism. The setup works well (provides coherent answers) when providing also text or text+image, but not video. This are also very different from the generated outputs when using transformers with the same arguments.

The code below follows the example on the Qwen repo (https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#inference-locally), but is also what seems to be recommended in vLLM docs

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
    # tensor_parallel_size=4,
    tensor_parallel_size=1,
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://ptchallenge-workshop.github.io/media/vis.mp4",
                "min_pixels": 224 * 224,
                # "max_pixels": 1280 * 28 * 28,
                "total_pixels": 16384 * 28 * 28,
                "fps": 2.0,
            },
            {"type": "text", "text": "Describe the video."},
        ],
    },
]
# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print("#"*50 + "\n" + "Qwen repo Video url output with total_pixels:", generated_text)

with output:

INFO 10-26 20:24:11 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
[rank0]:[W1026 20:24:28.958384079 ProcessGroupGloo.cpp:712] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 10-26 20:24:28 model_runner.py:1056] Starting to load model Qwen/Qwen2-VL-7B-Instruct...
INFO 10-26 20:24:30 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:07,  1.96s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:06<00:09,  3.27s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:10<00:07,  3.56s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:13<00:03,  3.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:24<00:00,  6.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:24<00:00,  4.90s/it]

INFO 10-26 20:24:56 model_runner.py:1067] Loading model weights took 15.5083 GB
INFO 10-26 20:25:06 gpu_executor.py:122] # GPU blocks: 56587, # CPU blocks: 4681
INFO 10-26 20:25:06 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 27.63x
INFO 10-26 20:25:09 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-26 20:25:09 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-26 20:25:34 model_runner.py:1523] Graph capturing finished in 25 secs.
Processed prompts: 100%|█| 1/1 [00:05<00:00,  5.39s/it, est. speed input: 2541.48 toks/s, output
##################################################
Qwen repo Video url output with total_pixels: The: in helpful photo helpful image: Image helpful helpful

For transformers the code is the default shown in the Qwen repo, which is indeed very similar. I tried to check through other issues and commits, and from my understanding this feature is supported, and the only difference in implementations seem to be minimal (#8408 (comment))

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

I try to reproduce your results and get the same problem, have you solved it yet, is there something related to too much image tokens?

@bhavyajoshi-mahindra
Copy link

I ended up with vllm 0.6.3, transformers 4.46.1, torch 2.4.0, CUDA 12.1, python 3.10
When I ran the inference code as given for my custom qwen2-vl model.

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen2-VL"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/content/drive/MyDrive/LLM/test/Vin_2023-12-22_14-47-37.jpg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text":
                                    '''
                                    Please extract the Vehicle Sr No, Engine No, and Model from this image.
                                    Response only json format nothing else.
                                    Analyze the font and double check for similar letters such as "V":"U", "8":"S":"0", "R":"P".
                                    '''
             },
        ],
    },
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

I got this error:

WARNING 10-30 12:06:32 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
ERROR 10-30 12:06:37 registry.py:264] Error in inspecting model architecture 'Qwen2VLForConditionalGeneration'
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 426, in _run_in_subprocess
ERROR 10-30 12:06:37 registry.py:264]     returned.check_returncode()
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 456, in check_returncode
ERROR 10-30 12:06:37 registry.py:264]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 10-30 12:06:37 registry.py:264] subprocess.CalledProcessError: Command '['F:\\Mahindra\\LLM\\myenv\\Scripts\\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 10-30 12:06:37 registry.py:264] 
ERROR 10-30 12:06:37 registry.py:264] The above exception was the direct cause of the following exception:
ERROR 10-30 12:06:37 registry.py:264]
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 262, in _try_inspect_model_cls        
ERROR 10-30 12:06:37 registry.py:264]     return model.inspect_model_cls()
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 224, in inspect_model_cls
ERROR 10-30 12:06:37 registry.py:264]     return _run_in_subprocess(
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess
ERROR 10-30 12:06:37 registry.py:264]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 10-30 12:06:37 registry.py:264] RuntimeError: Error raised in subprocess:
ERROR 10-30 12:06:37 registry.py:264] C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 10-30 12:06:37 registry.py:264]   warn(RuntimeWarning(msg))
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main   
ERROR 10-30 12:06:37 registry.py:264]     return _run_code(code, main_globals, None,
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
ERROR 10-30 12:06:37 registry.py:264]     exec(code, run_globals)
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 450, in <module>
ERROR 10-30 12:06:37 registry.py:264]     _run()
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 445, in _run
ERROR 10-30 12:06:37 registry.py:264]     with open(output_file, "wb") as f:
ERROR 10-30 12:06:37 registry.py:264] PermissionError: [Errno 13] Permission denied: 'C:\\Users\\bhavy\\AppData\\Local\\Temp\\tmpjxi5mk75'
ERROR 10-30 12:06:37 registry.py:264]
Traceback (most recent call last):
  File "F:\Mahindra\LLM\vllm\qwen2-vl-vllm-infer.py", line 7, in <module>
    llm = LLM(
  File "F:\Mahindra\LLM\vllm\vllm\entrypoints\llm.py", line 177, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "F:\Mahindra\LLM\vllm\vllm\engine\llm_engine.py", line 571, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "F:\Mahindra\LLM\vllm\vllm\engine\arg_utils.py", line 900, in create_engine_config
    model_config = self.create_model_config()
  File "F:\Mahindra\LLM\vllm\vllm\engine\arg_utils.py", line 837, in create_model_config
    return ModelConfig(
  File "F:\Mahindra\LLM\vllm\vllm\config.py", line 194, in __init__
    self.multimodal_config = self._init_multimodal_config(
  File "F:\Mahindra\LLM\vllm\vllm\config.py", line 213, in _init_multimodal_config
    if ModelRegistry.is_multimodal_model(architectures):
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 384, in is_multimodal_model
    return self.inspect_model_cls(architectures).supports_multimodal
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 353, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 314, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2VLForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'MistralModel', 'Qwen2ForRewardModel', 'Gemma2Model', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']

Note: "Qwen2VLForConditionalGeneration" is in the list of supported models but still I got the error.

@hector-gr can you help me with this?

@bhavyajoshi-mahindra
Copy link

Created new issues :
#9832

@hector-gr
Copy link
Author

@hector-gr Thank you for testing! :) May I ask you how you installed vLLM? I am assuming your test ran on 0.6.3?

And could you please test the same 4k and 5120x1440 images with then OpenAI API endpoint? (with latest vLLM)

The OpenAI API endpoint works correctly with those two images (base64 encoded).

@DarkLight1337
Copy link
Member

Is this issue solved now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants