Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

Open
Sekri0 opened this issue Sep 6, 2024 · 8 comments
Labels
bug Something isn't working stale

Comments

@Sekri0
Copy link

Sekri0 commented Sep 6, 2024

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

--enable-prefix-caching causing CUDA error: illegal memory access. According to the trackback, this bug seems to be caused by FlashAttention. I notice that PR 7018 and PR 7142 seem to have fixed the problem, but vLLM 0.5.5 still have this bug.

Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
output = self.execute_model(execute_model_req=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
output = self.execute_model(execute_model_req=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
return _flash_attn_varlen_func(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
return _flash_attn_varlen_func(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@Sekri0 Sekri0 added the bug Something isn't working label Sep 6, 2024
@Sekri0
Copy link
Author

Sekri0 commented Sep 6, 2024

@zachzzc @raywanb Sorry to bother you guys, could you please take a look at this problem?

@flexorRegev
Copy link

I think I have the same bug (also running 0.5.5) when running tp = 2 and using Neuralmagic quantized 8bit fp8 llama 3.1 70b
this happens rarely but some times it does (2x H100)

@zachzzc
Copy link
Contributor

zachzzc commented Sep 10, 2024

Can you provide a minimum script to reproduce your problem ? @Sekri0

@flexorRegev
Copy link

I can provide one for mine - but it happens statistically and I couldn't find the exact thing that causes it

@Sekri0
Copy link
Author

Sekri0 commented Sep 10, 2024

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

@zachzzc
Copy link
Contributor

zachzzc commented Sep 10, 2024

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at

num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
when it failed, specifically

The shape of

                           q=query,
                           k=key_cache,
                           v=value_cache,

And the values of

                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

@Sekri0
Copy link
Author

Sekri0 commented Sep 11, 2024

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at

num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa

when it failed, specifically
The shape of

                           q=query,
                           k=key_cache,
                           v=value_cache,

And the values of

                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

I found that sending a few specific requests at certain time intervals can reproduce this bug. I have printed and logged the few messages you mentioned. I am sorry that I cannot provide the specific content of the request for the time being because it contains some sensitive information.I will try to construct some non-confidential requests later.
q_shape: torch.Size([1620, 16, 128]), k_shape: torch.Size([17554, 16, 2, 128]), v_shape: torch.Size([17554, 16, 2, 128])
cu_seqlens_q: tensor([ 0, 68, 390, 1620], device='cuda:0', dtype=torch.int32), max_seqlen_q: 1230, cu_seqlens_k: tensor([ 0, 196, 646, 2004], device='cuda:0', dtype=torch.int32)
max_seqlen_k: 1358
block_table: tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 14, 15, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113]], device='cuda:0', dtype=torch.int32)

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants