[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

Sekri0 · 2024-09-06T08:26:20Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

--enable-prefix-caching causing CUDA error: illegal memory access. According to the trackback, this bug seems to be caused by FlashAttention. I notice that PR 7018 and PR 7142 seem to have fixed the problem, but vLLM 0.5.5 still have this bug.

Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
output = self.execute_model(execute_model_req=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
output = self.execute_model(execute_model_req=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 322, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1415, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 277, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 210, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
return _flash_attn_varlen_func(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 157, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 692, in forward
num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/_library/custom_ops.py", line 236, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 48, in flash_attn_varlen_func
return _flash_attn_varlen_func(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1154, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 632, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 90, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

Sekri0 · 2024-09-06T08:37:22Z

@zachzzc @raywanb Sorry to bother you guys, could you please take a look at this problem?

flexorRegev · 2024-09-06T20:36:22Z

I think I have the same bug (also running 0.5.5) when running tp = 2 and using Neuralmagic quantized 8bit fp8 llama 3.1 70b
this happens rarely but some times it does (2x H100)

zachzzc · 2024-09-10T00:11:03Z

Can you provide a minimum script to reproduce your problem ? @Sekri0

flexorRegev · 2024-09-10T11:22:32Z

I can provide one for mine - but it happens statistically and I couldn't find the exact thing that causes it

Sekri0 · 2024-09-10T11:42:41Z

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

zachzzc · 2024-09-10T16:28:38Z

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at

vllm/vllm/attention/backends/flash_attn.py

Line 692 in 6234385

num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa

when it failed, specifically

The shape of

                           q=query,
                           k=key_cache,
                           v=value_cache,

And the values of

                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

Sekri0 · 2024-09-11T10:04:27Z

Can you provide a minimum script to reproduce your problem ? @Sekri0

Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not sure exactly which request is causing the crash. I tried to reproduce the bug with offline inference, but failed.

it would be helpful if you can log/print the input of the flash attention at

vllm/vllm/attention/backends/flash_attn.py

Line 692 in 6234385

num_prefill_tokens] = torch.ops.vllm.flash_attn_varlen_func( # noqa

when it failed, specifically
The shape of
                           q=query,
                           k=key_cache,
                           v=value_cache,
And the values of
                           cu_seqlens_q=prefill_meta.query_start_loc,
                           max_seqlen_q=prefill_meta.max_query_len,
                           cu_seqlens_k=prefill_meta.seq_start_loc,
                           max_seqlen_k=max_seq_len,
                           block_table=prefill_meta.block_tables,

I found that sending a few specific requests at certain time intervals can reproduce this bug. I have printed and logged the few messages you mentioned. I am sorry that I cannot provide the specific content of the request for the time being because it contains some sensitive information.I will try to construct some non-confidential requests later.
q_shape: torch.Size([1620, 16, 128]), k_shape: torch.Size([17554, 16, 2, 128]), v_shape: torch.Size([17554, 16, 2, 128])
cu_seqlens_q: tensor([ 0, 68, 390, 1620], device='cuda:0', dtype=torch.int32), max_seqlen_q: 1230, cu_seqlens_k: tensor([ 0, 196, 646, 2004], device='cuda:0', dtype=torch.int32)
max_seqlen_k: 1358
block_table: tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 14, 15, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113]], device='cuda:0', dtype=torch.int32)

github-actions · 2024-12-12T02:05:53Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sekri0 added the bug Something isn't working label Sep 6, 2024

ericzhou571 mentioned this issue Sep 23, 2024

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b #6208

Open

sasha0552 mentioned this issue Oct 20, 2024

[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together #9532

Merged

github-actions bot added the stale label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

Sekri0 commented Sep 6, 2024 •

edited

Loading

Sekri0 commented Sep 6, 2024

flexorRegev commented Sep 6, 2024

zachzzc commented Sep 10, 2024

flexorRegev commented Sep 10, 2024

Sekri0 commented Sep 10, 2024

zachzzc commented Sep 10, 2024 •

edited

Loading

Sekri0 commented Sep 11, 2024

github-actions bot commented Dec 12, 2024

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access #8230

Comments

Sekri0 commented Sep 6, 2024 • edited Loading

Your current environment

🐛 Describe the bug

Sekri0 commented Sep 6, 2024

flexorRegev commented Sep 6, 2024

zachzzc commented Sep 10, 2024

flexorRegev commented Sep 10, 2024

Sekri0 commented Sep 10, 2024

zachzzc commented Sep 10, 2024 • edited Loading

Sekri0 commented Sep 11, 2024

github-actions bot commented Dec 12, 2024

Sekri0 commented Sep 6, 2024 •

edited

Loading

zachzzc commented Sep 10, 2024 •

edited

Loading