[Bug] RuntimeError: shape '[-1, 0]' is invalid for input of size #405

dgedanke · 2024-07-24T09:38:30Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用 https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B 官方 Quick Start demo 进行视频推理时报错，这是代码：

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


path = '/data/swift/models/InternVL2-4B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`

generation_config = dict(
    num_beams=1,
    max_new_tokens=1024,
    do_sample=False,
)


# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list


video_path = '/data/swift/dataset/videos/videoTest/3f403ed59a3ffc0612d75ae9b8f109f4_raw.mp4'

# pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'Describe this video in detail'
# Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')

# question = 'Describe this video in detail. Don\'t repeat.'
# response, history = model.chat(tokenizer, pixel_values, question, generation_config,
#                                num_patches_list=num_patches_list,
#                                history=history, return_history=True)
# print(f'User: {question}')
# print(f'Assistant: {response}')

仅仅改了文件路径

Reproduction

出现了下面报错

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12.35it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Traceback (most recent call last):
  File "/data/swift/model_test/4b/tmp.py", line 147, in <module>
    response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-4B/modeling_internvl_chat.py", line 284, in chat
    generation_output = self.generate(
                        ^^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-4B/modeling_internvl_chat.py", line 334, in generate
    outputs = self.language_model.generate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/transformers/generation/utils.py", line 1914, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/transformers/generation/utils.py", line 2651, in _sample
    outputs = self(
              ^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-4B/modeling_phi3.py", line 1290, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda3/envs/lmdeploy/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-4B/modeling_phi3.py", line 1119, in forward
    position_ids = position_ids.view(-1, seq_length).long()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[-1, 0]' is invalid for input of size 2180

更换视频还是不行

Environment

check_env 的结果：

```sh

sys.platform: linux
Python: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.66
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.0+
transformers: 4.42.4
gradio: 3.43.0
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0

The text was updated successfully, but these errors were encountered:

czczup · 2024-07-24T16:54:53Z

Phi-3 4B报这个错是因为transformers版本太新了，好像有什么接口发生了变化，使用transformers 4.37.2是可以正常运行的。我这两天有些比较紧急的工作要处理，可能得过几天才能做这个高版本的兼容（按理说可以直接从transformers里面import Phi3了，我记得是整合进去了，这里可能需要根据版本号做判断）

（如果您愿意的话，欢迎给我们提PR）

dgedanke · 2024-07-26T00:50:55Z

Phi-3 4B报这个错是因为transformers版本太新了，好像有什么接口发生了变化，使用transformers 4.37.2是可以正常运行的。我这两天有些比较紧急的工作要处理，可能得过几天才能做这个高版本的兼容（按理说可以直接从transformers里面import Phi3了，我记得是整合进去了，这里可能需要根据版本号做判断）

（如果您愿意的话，欢迎给我们提PR）

好的，非常感谢您！

maxaitel · 2024-09-11T21:51:46Z

any work being done on this issue?

hhaAndroid · 2024-09-18T05:14:49Z

@maxaitel
https://huggingface.co/OpenGVLab/InternVL2-4B/blob/main/modeling_phi3.py#L1373

if inputs_embeds is not None and past_key_values is None:
    model_inputs = {"inputs_embeds": inputs_embeds}
else:

to

if (inputs_embeds is not None and past_key_values is None) or (inputs_embeds is not None and len(past_key_values)==0):
    model_inputs = {"inputs_embeds": inputs_embeds}
else:

czczup closed this as completed Jul 30, 2024

chchenhui mentioned this issue Aug 5, 2024

add InternVL2-8B video model WildVision-AI/LMM-Engines#4

Merged

ChengjieSheng mentioned this issue Sep 23, 2024

Error in use MiniGPT4-Video inference Vision-CAIR/MiniGPT4-video#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RuntimeError: shape '[-1, 0]' is invalid for input of size #405

[Bug] RuntimeError: shape '[-1, 0]' is invalid for input of size #405

dgedanke commented Jul 24, 2024 •

edited

Loading

czczup commented Jul 24, 2024 •

edited

Loading

dgedanke commented Jul 26, 2024

maxaitel commented Sep 11, 2024

hhaAndroid commented Sep 18, 2024

[Bug] RuntimeError: shape '[-1, 0]' is invalid for input of size #405

[Bug] RuntimeError: shape '[-1, 0]' is invalid for input of size #405

Comments

dgedanke commented Jul 24, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

czczup commented Jul 24, 2024 • edited Loading

dgedanke commented Jul 26, 2024

maxaitel commented Sep 11, 2024

hhaAndroid commented Sep 18, 2024

dgedanke commented Jul 24, 2024 •

edited

Loading

czczup commented Jul 24, 2024 •

edited

Loading