-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: RuntimeError: No suitable kernel. h_in=16 h_out=7392 dtype=Float out_dtype=BFloat16 #6126
Comments
I build vllm from source, pre-release vtest. And I export VLLM_INSTALL_PUNICA_KERNELS=1. |
same question with you |
Hi, #5036 should be able to address your issue. You can clone the corresponding branch to test it. |
Thanks, the branch "refactor-punica-kernel" works well. |
But there is a bug, when I run the above py the second time, it will cause an error :
I have to delete all the cache manully to run it again. INFO 07-04 20:07:09 config.py:703] Defaulting to use mp for distributed inference
INFO 07-04 20:07:09 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='/data03/xxx_share/Qwen/Qwen2-72B-Instruct', speculative_config=None, tokenizer='/data03/xxx_share/Qwen/Qwen2-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data03/xxx_share/Qwen/Qwen2-72B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
�[1;36m(VllmWorkerProcess pid=87017)�[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=87018)�[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=87017)�[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=87017)�[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=87018)�[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=87018)�[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=87017)�[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=87018)�[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-04 20:07:32 model_runner.py:254] Loading model weights took 33.9833 GB
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m INFO 07-04 20:07:32 model_runner.py:254] Loading model weights took 33.9833 GB
�[1;36m(VllmWorkerProcess pid=87017)�[0;0m INFO 07-04 20:07:33 model_runner.py:254] Loading model weights took 33.9833 GB
�[1;36m(VllmWorkerProcess pid=87018)�[0;0m INFO 07-04 20:07:33 model_runner.py:254] Loading model weights took 33.9833 GB
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'utf-8' codec can't decode byte 0xbe in position 18: invalid start byte, Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/worker.py", line 175, in determine_num_available_blocks
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.model_runner.profile_run()
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 849, in profile_run
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 1215, in execute_model
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 336, in forward
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 257, in forward
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states, residual = layer(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 209, in forward
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states = self.self_attn(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 153, in forward
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] qkv, _ = self.qkv_proj(hidden_states)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 511, in forward
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] output_parallel = self.apply(input_, bias)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 918, in apply
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] _apply_lora_packed_nslice(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 134, in _apply_lora_packed_nslice
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] add_lora(output,
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 253, in add_lora
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] add_expand_slice(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 159, in add_expand_slice
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] sgmv_expand_slice(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/ops/sgmv_expand_slice.py", line 178, in sgmv_expand_slice
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] _sgmv_expand_slice_kernel[grid](
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.cache[device][key] = compile(
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 202, in compile
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return CompiledKernel(so_path, metadata_group.get(metadata_filename))
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 230, in __init__
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.asm = {
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 231, in <dictcomp>
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.binary_ext else file.read_text()
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/pathlib.py", line 1135, in read_text
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return f.read()
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/codecs.py", line 322, in decode
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] (result, consumed) = self._buffer_decode(data, self.errors, final)
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 18: invalid start byte
�[1;36m(VllmWorkerProcess pid=87016)�[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226]
|
When I run it with vllm's api server: python -m vllm.entrypoints.openai.api_server --... , the lora adapter seems no effect. The model seems like origion.But lora adapter works well when I run the above py. |
This is a triton bug, refer to :#6103 Currently, you can temporarily avoid this error by setting llm = vllm.LLM(
MODEL_PATH,
enable_lora=True,
max_num_seqs=16,
max_loras=2,
trust_remote_code=True,
gpu_memory_utilization=0.3,
tensor_parallel_size=4,
distributed_executor_backend="ray"
) |
I will check into this issue tomorrow |
set distributed_executor_backend="ray" |
I run api server with llama-factory, and the adapter works well. |
This should be resolved with the new landed Triton kernels #5036 |
Your current environment
🐛 Describe the bug
CUDA_VISIBLE_DEVICES=4,5,6,7 python vllm_qwen2_lora.py
The text was updated successfully, but these errors were encountered: