Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

Closed
dionren opened this issue Aug 6, 2024 · 5 comments
Closed

💡 [REQUEST] - 48G单卡无法加载MiniCPM-V-2_6 #392

dionren opened this issue Aug 6, 2024 · 5 comments
Labels
question Further information is requested

Comments

@dionren
Copy link

dionren commented Aug 6, 2024

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

相关Issues | Reference Issues

No response

摘要 | Summary

显示内存溢出

基本示例 | Basic Example

docker run -d --ipc host \
  --gpus '"device=1"' \
  -v /mnt/cpn-pod/b11d5292-85ab-e9a4-7eca-31614bb76c91:/mnt/cpn-pod \
  -p 8100:8000 \
  192.168.200.5/docker/vllm/vllm-openai:v0.5.4 \
  --disable-log-requests --disable-log-stats \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.98 \
  --model /mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6 \
  --served-model-name openbmb/MiniCPM-V-2_6 \
  --tensor-parallel-size 1

缺陷 | Drawbacks

INFO 08-06 18:07:04 api_server.py:339] vLLM API server version 0.5.4
INFO 08-06 18:07:04 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=32768, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['openbmb/MiniCPM-V-2_6'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=True, max_log_len=None)
WARNING 08-06 18:07:04 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-06 18:07:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', speculative_config=None, tokenizer='/mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=openbmb/MiniCPM-V-2_6, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-06 18:07:06 model_runner.py:720] Starting to load model /mnt/cpn-pod/models/openbmb/MiniCPM-V-2_6...
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.51it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 08-06 18:07:11 model_runner.py:732] Loading model weights took 15.1930 GB
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 624, in forward
    vlm_embeddings, _ = self.get_embedding(input_ids, image_inputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 530, in get_embedding
    vision_hidden_states = self.get_vision_hidden_states(image_inputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/minicpmv.py", line 980, in get_vision_hidden_states
    vision_embedding = self.vpm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 785, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 686, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 585, in forward
    hidden_states, attn_weights = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/na_vit.py", line 347, in forward
    attn_weights = nn.functional.softmax(attn_weights,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1890, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 47.50 GiB of which 3.08 GiB is free. Process 320105 has 44.41 GiB memory in use. Of the allocated memory 43.67 GiB is allocated by PyTorch, and 335.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

未解决问题 | Unresolved questions

No response

@dionren dionren added the question Further information is requested label Aug 6, 2024
@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

我也遇到了这个问题,加载的时候会严重的爆内存,而且永远提示gpu0爆内存

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

我在一张A100-80G显卡上面做了测试,发现使用vllm加载时,内存会先到16GB(读取模型),读取完毕后的某一个瞬间,内存会达到29GB的峰值,然后又降低到了19GB。原因不明。

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

方案找到了,vllm的max-num-seqs默认为256,这将导致在初始化启动阶段带来极高的内存消耗,可以考虑降低到32,且gpu-memory-utilization调整为1,就可以在单张3090上成功运行了这个模型。

@HwwwwwwwH
Copy link
Contributor

我在一张A100-80G显卡上面做了测试,发现使用vllm加载时,内存会先到16GB(读取模型),读取完毕后的某一个瞬间,内存会达到29GB的峰值,然后又降低到了19GB。原因不明。

vllm 初始化的时候会拿一些数据去空跑,图像这边是用的 max_model_len(默认8192) 除以图像的 token 数,而 MiniCPM-V 的 token 数比较少(64),所以算出来会是 128 张图,实际上是用不到这么多的,我之前和那边沟通了一下,推荐是在初始化的时候把 max_model_len 参数设置小一些。

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

我当时测试的时候无法单纯降低max_model_len,我降低到了3072都无法运行。
后来是max_model_len=4096,nax_num_seqs=32,gpu_memory_utilization=1实现的运行

@Cuiunbo Cuiunbo closed this as completed Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants