Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: enable_prefix_caching leads to persistent illegal memory access error #6833

Closed
captify-sivakhno opened this issue Jul 26, 2024 · 10 comments
Labels
bug Something isn't working stale

Comments

@captify-sivakhno
Copy link

captify-sivakhno commented Jul 26, 2024

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1064-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           64 MiB (4 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.11.0
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.1
[pip3] torcheval==0.0.7
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks```

🐛 Describe the bug

After running the code

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from outlines.integrations.vllm import RegexLogitsProcessor

import os
os.environ["HF_TOKEN"] = ""

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", enable_prefix_caching=True)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

proc = RegexLogitsProcessor(r'yes|no', llm)
sampling_params = SamplingParams(temperature=0.6, top_p=0.15, max_tokens=1, logits_processors=[proc])

prompts = ["some long text up to the max model length / 20000 chars", "some long text up to the max model length / 20000 chars", ...] <- list of length 100 to 1000

formatted_prompts = []
for prompt in prompts:
    messages = [{"role": "user", "content": prompt["prompt"]}]
    formatted_prompts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

output = llm.generate(formatted_prompts, sampling_params)

I get an error

RuntimeError: CUDA error: an illegal memory access was encountered

The error seems to happen randomly and sometimes I don't get an error running the same command in the same environment and versions,

I have done the following investigations and can confirm:

  • setting enable_prefix_caching=False removes the error
  • prompt length does not seem too impact the error, changing 20k char prompt to 2k char prompt does not remove error
  • removing RegexLogitsProcessor does not fix the error
  • trying 0.4.2 and other versions does not help
  • decreasing gpu memory usage to 0.8 does not help
  • using os.environ["VLLM_ATTENTION_BACKEND"]="XFORMERS" leads to The Python process exited with exit code 139 (SIGSEGV: Segmentation fault)

I have seen quite a few different issues with enable_prefix_caching, could anyone comment if the feature actually worked for them? We have a lot of 80-90% repetitive prompts in our use cases so prefix caching provides dramatic speed-up. Would be grateful for any suggestions!

Full error detail
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
File <command-1575781236477471>, line 6
      4 os.environ["VLLM_TRACE_FUNCTION"]="TRACE"
      5 os.environ["CUDA_LAUNCH_BLOCKING"]="1"
----> 6 output = llm.generate(formatted_prompts[300:1000], sampling_params)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/utils.py:838, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
    831             msg += f" {additional_message}"
    833         warnings.warn(
    834             DeprecationWarning(msg),
    835             stacklevel=3,  # The inner function takes up one level
    836         )
--> 838 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:316, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
    308     sampling_params = SamplingParams()
    310 self._validate_and_add_requests(
    311     inputs=inputs,
    312     params=sampling_params,
    313     lora_request=lora_request,
    314     prompt_adapter_request=prompt_adapter_request)
--> 316 outputs = self._run_engine(use_tqdm=use_tqdm)
    317 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:569, in LLM._run_engine(self, use_tqdm)
    567 total_out_toks = 0
    568 while self.llm_engine.has_unfinished_requests():
--> 569     step_outputs = self.llm_engine.step()
    570     for output in step_outputs:
    571         if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/engine/llm_engine.py:911, in LLMEngine.step(self)
    901     finished_requests_ids = self.scheduler[
    902         0].get_and_reset_finished_requests_ids()
    903     execute_model_req = ExecuteModelRequest(
    904         seq_group_metadata_list=seq_group_metadata_list,
    905         blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
   (...)
    909         running_queue_size=scheduler_outputs.running_queue_size,
    910         finished_requests_ids=finished_requests_ids)
--> 911     output = self.model_executor.execute_model(
    912         execute_model_req=execute_model_req)
    913 else:
    914     output = []
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:110, in GPUExecutor.execute_model(self, execute_model_req)
    107 def execute_model(
    108     self, execute_model_req: ExecuteModelRequest
    109 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 110     output = self.driver_worker.execute_model(execute_model_req)
    111     return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/worker_base.py:272, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    268 if not get_pp_group().is_first_rank:
    269     intermediate_tensors = IntermediateTensors(
    270         get_pp_group().recv_tensor_dict())
--> 272 output = self.model_runner.execute_model(
    273     model_input, self.kv_cache[worker_input.virtual_engine]
    274     if self.kv_cache is not None else None, intermediate_tensors,
    275     num_steps)
    277 if not get_pp_group().is_last_rank:
    278     # output is IntermediateTensors
    279     get_pp_group().send_tensor_dict(output.tensors)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/model_runner.py:1334, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1331     return []
   1333 # Sample the next token.
-> 1334 output: SamplerOutput = self.model.sample(
   1335     logits=logits,
   1336     sampling_metadata=model_input.sampling_metadata,
   1337 )
   1339 if self.return_hidden_states:
   1340     # we only need to pass hidden states of most recent token
   1341     assert model_input.sampling_metadata is not None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:437, in LlamaForCausalLM.sample(self, logits, sampling_metadata)
    432 def sample(
    433     self,
    434     logits: torch.Tensor,
    435     sampling_metadata: SamplingMetadata,
    436 ) -> Optional[SamplerOutput]:
--> 437     next_tokens = self.sampler(logits, sampling_metadata)
    438     return next_tokens
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:91, in Sampler.forward(self, logits, sampling_metadata)
     89 # Prepare sampling tensors with pinned memory to avoid blocking.
     90 if not sampling_metadata.reuse_sampling_tensors:
---> 91     self._init_sampling_tensors(logits, sampling_metadata)
     92 elif self._do_penalties:
     93     # In this case, the sampling tensors logic depends on
     94     # "output_tokens" of a sequence. As a result, we cannot
     95     # reuse sampling tensors, since "output_tokens" changes
     96     # between decode runs.
     97     self._init_sampling_tensors(logits, sampling_metadata)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:68, in Sampler._init_sampling_tensors(self, logits, sampling_metadata)
     64 self._sampling_tensors = None
     66 # Initialize new sampling tensors
     67 (sampling_tensors, do_penalties, do_top_p_top_k,
---> 68  do_min_p) = SamplingTensors.from_sampling_metadata(
     69      sampling_metadata, vocab_size, logits.device, logits.dtype)
     71 self._sampling_tensors = sampling_tensors
     72 self._do_penalties = do_penalties
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:443, in SamplingTensors.from_sampling_metadata(cls, sampling_metadata, vocab_size, device, dtype, extra_seeds_to_generate, extra_entropy)
    440                 prompt_tokens.append(list(seq_data.prompt_token_ids))
    441                 output_tokens.append(list(seq_data.output_token_ids))
--> 443 sampling_tensors = SamplingTensors.from_lists(
    444     temperatures, top_ps, top_ks, min_ps, presence_penalties,
    445     frequency_penalties, repetition_penalties, sampling_seeds,
    446     sample_indices, prompt_tokens, output_tokens, vocab_size,
    447     extra_seeds_to_generate, device, dtype)
    448 return (sampling_tensors, do_penalties, do_top_p_top_k, do_min_p)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:487, in SamplingTensors.from_lists(cls, temperatures, top_ps, top_ks, min_ps, presence_penalties, frequency_penalties, repetition_penalties, sampling_seeds, sample_indices, prompt_tokens, output_tokens, vocab_size, extra_seeds_to_generate, device, dtype)
    484     prompt_t = empty_tensor
    485     output_t = empty_tensor
--> 487 temperatures_t = torch.tensor(
    488     temperatures,
    489     device="cpu",
    490     dtype=dtype,
    491     pin_memory=pin_memory,
    492 )
    493 top_ps_t = torch.tensor(
    494     top_ps,
    495     device="cpu",
    496     dtype=dtype,
    497     pin_memory=pin_memory,
    498 )
    499 min_ps_t = torch.tensor(
    500     min_ps,
    501     device="cpu",
    502     dtype=dtype,
    503     pin_memory=pin_memory,
    504 )
@captify-sivakhno captify-sivakhno added the bug Something isn't working label Jul 26, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

Can you share the exact prompts you are sending? This issue occurs sporadically, so detailed reproduction instructions would be very beneficial for us

@captify-sivakhno
Copy link
Author

@robertgshaw2-neuralmagic thanks for fast reply, here's the link to a file with 5000 prompts

formatted_prompts.txt.zip

generated as

with open('/Volumes/qa/tv_segmentation_bronze/misc/formatted_prompts.txt', 'w') as f:
    for item in formatted_prompts:
        f.write("%s\n" % item)

This is what went into the input

output = llm.generate(formatted_prompts, sampling_params)

@captify-sivakhno
Copy link
Author

captify-sivakhno commented Jul 27, 2024

BTW @robertgshaw2-neuralmagic if you have access to Databricks, one option to easily and fully reproduce environment is running in notebook on 15.4 LTS ML Beta (15.4.x-gpu-ml-scala2.12) runtime, as that's where I ran it.

@captify-sivakhno
Copy link
Author

@robertgshaw2-neuralmagic - regarding your comment about the prompts content above, any suggestions as to which properties of prompts might be causing the error. I have rerun by re-using only the first prompt as an example

# other code as before
output = llm.generate(formatted_prompts[0]*len(formatted_prompts), sampling_params)

and it completed fine.
This is encouraging, but the range of error possibilities is quite high (length of prompt, token composition, pattern of cache reuse, etc)

@mengban
Copy link

mengban commented Jul 30, 2024

mark. met same problem

@chenchunhui97
Copy link

mark, met same problem in v0.5.0post1

@Playerrrrr
Copy link

same

@zachzzc
Copy link
Contributor

zachzzc commented Aug 1, 2024

Also seeing the same problem and I found the issues arises at the time when a cached prefill request scheduled together with non-cached request. The problem is gone if I force it to only schedule one prefill request. Still debugging.

Copy link

github-actions bot commented Nov 1, 2024

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 1, 2024
Copy link

github-actions bot commented Dec 1, 2024

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants