Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Qwen2-VL占用显存过大导致OOM #2565

Open
3 tasks done
cmpute opened this issue Oct 9, 2024 · 8 comments
Open
3 tasks done

[Bug] Qwen2-VL占用显存过大导致OOM #2565

cmpute opened this issue Oct 9, 2024 · 8 comments
Assignees

Comments

@cmpute
Copy link
Contributor

cmpute commented Oct 9, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Qwen2-VL 7B按理说80G的显存是能跑下的,但实际部署时推理会OOM

Reproduction

lmdeploy serve api_server ../Qwen2-VL-7B-Instruct --server-port 12345

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.6.1+
transformers: 4.45.2
gradio: 4.41.0
fastapi: 0.112.1
pydantic: 2.8.2
triton: 2.2.0

Error traceback

2024-10-09 04:07:10,136 - lmdeploy - WARNING - archs.py:53 - Fallback to pytorch engine because `../Qwen2-VL-7B-Instruct` not supported by turbomind engine.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
INFO:     Started server process [6550]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12345 (Press CTRL+C to quit)
INFO:     127.0.0.1:35220 - "GET /v1/models HTTP/1.1" 200 OK
Exception in callback _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20
handle: <Handle _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20>
Traceback (most recent call last):
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 27, in _raise_exception_on_finish
    raise e
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 23, in _raise_exception_on_finish
    task.result()
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 169, in forward
    outputs = self.model.forward(*func_inputs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/qwen2.py", line 102, in forward
    image_embeds = self.model.visual(pixel_values,
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1128, in forward
    hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 431, in forward
    hidden_states = hidden_states + self.attn(
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 404, in forward
    attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.06 GiB. GPU 0 has a total capacity of 79.35 GiB of which 12.84 GiB is free. Process 89800 has 66.50 GiB memory in use. Of the allocated memory 65.56 GiB is allocated by PyTorch, and 421.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
@jianliao
Copy link

@cmpute 请问你的图片分辨率多少?换成小一点的图片可能就行。

我查看了Qwen在HF的文档,这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小,请问在lmdeploy里面怎么能传入这中启动参数?我现在的启动脚本如下:

lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct

@cmpute
Copy link
Contributor Author

cmpute commented Oct 12, 2024

@cmpute 请问你的图片分辨率多少?换成小一点的图片可能就行。

我查看了Qwen在HF的文档,这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小,请问在lmdeploy里面怎么能传入这中启动参数?我现在的启动脚本如下:

lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct

图片挺大的,确实有可能,有空试下

@irexyc
Copy link
Collaborator

irexyc commented Oct 17, 2024

@jianliao
Copy link

以上文档里面指定min_pixels/max_pixels, resized_height/resized_width的例子是基于pipeline的,请问如果是serve模式,该如何设定?

@Titan-p
Copy link

Titan-p commented Oct 18, 2024

以上文档里面指定min_pixels/max_pixels, resized_height/resized_width的例子是基于pipeline的,请问如果是serve模式,该如何设定?

request 中添加
{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "请描述下图片" }, { "type": "image_url", "image_url": { "max_pixels": "1000000", "url": IMAGE_URL } } ] } ], }

@jianliao
Copy link

@Titan-p 试过了,确实可行。就是用起来很不方便,不知道是否得通过扩展客户端聊天UI应用程序来给每个带图片请求私下增加这个像素限制属性。

有没有办法直接在Server端直接配置呢?

@irexyc
Copy link
Collaborator

irexyc commented Oct 25, 2024

@jianliao

pipeline 的示例中,message的格式就是openai的格式,使用server的时候传这个message就可以了。

server 端配置是指设置全局的最大最小像素么?目前没这个功能,只能通过改代码来控制。具体位置的在这里,可以在下面加一行比如

if 'max_pixels' not in item:
    item.update(dict(max_pixels=64 * 28 * 28))

@Wiselnn570
Copy link

想请教一下目前这个仓库支持qwen2-vl输入视频吗 @irexyc @jianliao @Titan-p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants