💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

dionren · 2024-08-06T18:08:35Z

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

显示内存溢出

基本示例 | Basic Example

docker run -d --ipc host \
  --gpus '"device=1"' \
  -v /mnt/cpn-pod/b11d5292-85ab-e9a4-7eca-31614bb76c91:/mnt/cpn-pod \
  -p 8100:8000 \
  192.168.200.5/docker/vllm/vllm-openai:v0.5.4 \
  --disable-log-requests --disable-log-stats \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.98 \
  --model /mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6 \
  --served-model-name openbmb/MiniCPM-V-2_6 \
  --tensor-parallel-size 1

缺陷 | Drawbacks

INFO 08-06 18:07:04 api_server.py:339] vLLM API server version 0.5.4
INFO 08-06 18:07:04 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=32768, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['openbmb/MiniCPM-V-2_6'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
WARNING 08-06 18:07:04 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-06 18:07:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', speculative_config=None, tokenizer='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=openbmb/MiniCPM-V-2_6, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-06 18:07:06 model_runner.py:720] Starting to load model /mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6...
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.51it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 08-06 18:07:11 model_runner.py:732] Loading model weights took 15.1930 GB
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 624, in forward
    vlm_embeddings, _ = self.get_embedding(input_ids, image_inputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 530, in get_embedding
    vision_hidden_states = self.get_vision_hidden_states(image_inputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 980, in get_vision_hidden_states
    vision_embedding = self.vpm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 785, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 686, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 585, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 347, in forward
    attn_weights = nn.functional.softmax(attn_weights,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1890, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 47.50 GiB of which 3.08 GiB is free. Process 320105 has 44.41 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

未解决问题 | Unresolved questions

No response

The text was updated successfully, but these errors were encountered:

AlphaINF · 2024-08-07T05:05:42Z

我也遇到了这个问题，加载的时候会严重的爆内存，而且永远提示gpu0爆内存

AlphaINF · 2024-08-07T08:12:48Z

我在一张A100-80G显卡上面做了测试，发现使用vllm加载时，内存会先到16GB（读取模型），读取完毕后的某一个瞬间，内存会达到29GB的峰值，然后又降低到了19GB。原因不明。

AlphaINF · 2024-08-07T08:23:49Z

方案找到了，vllm的max-num-seqs默认为256，这将导致在初始化启动阶段带来极高的内存消耗，可以考虑降低到32，且gpu-memory-utilization调整为1，就可以在单张3090上成功运行了这个模型。

HwwwwwwwH · 2024-08-07T11:31:35Z

我在一张A100-80G显卡上面做了测试，发现使用vllm加载时，内存会先到16GB（读取模型），读取完毕后的某一个瞬间，内存会达到29GB的峰值，然后又降低到了19GB。原因不明。

vllm 初始化的时候会拿一些数据去空跑，图像这边是用的 max_model_len（默认8192）除以图像的 token 数，而 MiniCPM-V 的 token 数比较少（64），所以算出来会是 128 张图，实际上是用不到这么多的，我之前和那边沟通了一下，推荐是在初始化的时候把 max_model_len 参数设置小一些。

AlphaINF · 2024-08-07T12:00:46Z

我当时测试的时候无法单纯降低max_model_len，我降低到了3072都无法运行。
后来是max_model_len=4096，nax_num_seqs=32，gpu_memory_utilization=1实现的运行

dionren added the question Further information is requested label Aug 6, 2024

AlphaINF mentioned this issue Aug 7, 2024

[Model] Adding support for MiniCPM-V vllm-project/vllm#4087

Merged

Cuiunbo closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

dionren commented Aug 6, 2024 •

edited

Loading

AlphaINF commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

HwwwwwwwH commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

Comments

dionren commented Aug 6, 2024 • edited Loading

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

AlphaINF commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

HwwwwwwwH commented Aug 7, 2024

AlphaINF commented Aug 7, 2024

dionren commented Aug 6, 2024 •

edited

Loading