Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: segfault when using google/gemma-2-27b-it on vLLM #6252

Closed
federicotorrielli opened this issue Jul 9, 2024 · 8 comments
Closed

[Bug]: segfault when using google/gemma-2-27b-it on vLLM #6252

federicotorrielli opened this issue Jul 9, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@federicotorrielli
Copy link

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.0
Libc version: glibc-2.35

Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-11
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 1
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           6
BogoMIPS:                           6000.20
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip pku ospke gfni vaes vpclmulqdq rdpid md_clear flush_l1d arch_capabilities
Hypervisor vendor:                  Xen
Virtualization type:                full
L1d cache:                          576 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           15 MiB (12 instances)
L3 cache:                           216 MiB (12 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-11
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.26.2
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.3.0
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] flashinfer                0.0.8+cu121torch2.3          pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.42.3                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-11	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Running a simple program that does 1609 inferences on a A100-80G.

VLLM_ATTENTION_BACKEND=FLASHINFER python3 program.py

Here's the output at the time of the segmentation fault, always exactly after 1579 prompts:

Running model google/gemma-2-27b-it
WARNING 07-09 09:05:14 utils.py:562] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 07-09 09:05:14 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='google/gemma-2-27b-it', speculative_config=None, tokenizer='google/gemma-2-27b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=google/gemma-2-27b-it, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-09 09:05:15 selector.py:79] Using Flashinfer backend.
WARNING 07-09 09:05:15 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 07-09 09:05:15 selector.py:79] Using Flashinfer backend.
WARNING 07-09 09:05:15 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 07-09 09:05:15 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-09 09:05:24 model_runner.py:255] Loading model weights took 50.8043 GB
INFO 07-09 09:05:25 gpu_executor.py:84] # GPU blocks: 3101, # CPU blocks: 712
INFO 07-09 09:05:28 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-09 09:05:28 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-09 09:05:38 model_runner.py:1117] Graph capturing finished in 10 secs.
Processing conceptnet_zero_shot_prompt_v1.json with 1609 prompts for task conceptnet
Processed prompts:  98%|██████████████████▋| 1579/1609 [01:15<00:00, 32.48it/s, est. speed input: 1056.62 toks/s, output: 1332.37 toks/s]Segmentation fault (core dumped)

The segfault reports:

[ 2569.226857] pt_main_thread[7163]: segfault at 75fb91e00000 ip 0000760e658dfe98 sp 00007ffedcf1a820 error 6 in _kernels.cpython-310-x86_64-linux-gnu.so[760e65672000+21e6000] likely on CPU 11 (core 22, socket 0)
[ 2765.066940] pt_main_thread[7349]: segfault at 7b37e9e00000 ip 00007b4abd2dfe98 sp 00007fff3116c6b0 error 6 in _kernels.cpython-310-x86_64-linux-gnu.so[7b4abd072000+21e6000] likely on CPU 2 (core 4, socket 0)
[ 3449.229246] pt_main_thread[8172]: segfault at 72beb1e00000 ip 000072d1848dfe88 sp 00007ffe38363ad0 error 6 in _kernels.cpython-311-x86_64-linux-gnu.so[72d184672000+21e6000] likely on CPU 7 (core 14, socket 0)
[ 4760.000557] pt_main_thread[9885]: segfault at 7c5199e00000 ip 00007c6472edfe88 sp 00007ffce83ef8e0 error 6 in _kernels.cpython-311-x86_64-linux-gnu.so[7c6472c72000+21e6000] likely on CPU 6 (core 12, socket 0)
[ 5139.093498] pt_main_thread[10700]: segfault at 7f9469e00000 ip 00007fa745edfe88 sp 00007ffc30144730 error 6 in _kernels.cpython-311-x86_64-linux-gnu.so[7fa745c72000+21e6000] likely on CPU 0 (core 0, socket 0)
@federicotorrielli federicotorrielli added the bug Something isn't working label Jul 9, 2024
@ciarancourtney
Copy link

Facing a similar issue running quantised gemma2-9b-it on a a10g with 24GiB VRAM, seems to be happening in the flashinfer wrapper:

decode.py(514):             self._wrapper.begin_forward(
decode.py(515):                 self._workspace_buffer,
decode.py(516):                 indptr,
decode.py(517):                 last_page_len,
decode.py(518):                 batch_size,
decode.py(519):                 num_qo_heads,
decode.py(520):                 num_kv_heads,
decode.py(521):                 head_dim,
decode.py(522):                 page_size,
decode.py(523):                 PosEncodingMode[pos_encoding_mode].value,
 --- modulename: enum, funcname: __getitem__
enum.py(440):         return cls._member_map_[name]
 --- modulename: types, funcname: __get__
types.py(177):         if instance is None:
types.py(181):         elif self.fget is None:
types.py(183):         return self.fget(instance)
 --- modulename: enum, funcname: value
enum.py(804):         return self._value_
decode.py(524):                 logits_soft_cap,
decode.py(525):                 empty_q_data,
decode.py(526):                 empty_kv_data,
decode.py(514):             self._wrapper.begin_forward(

Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)

@WendyShang
Copy link

i constantly run into seg fault when running gemma 2 9B it on 1 a single H100

INFO 07-09 17:05:02 async_llm_engine.py:134] Finished request cmpl-222768e59e5d4adf940595c76d71d790.                                                                             
INFO:     10.145.62.199:42532 - "POST /v1/chat/completions HTTP/1.1" 200 OK                                                                                                      
Segmentation fault  

@LiuXiaoxuanPKU
Copy link
Collaborator

It would be highly appreciated if any of you could provide a minimum reproducible script. Happy to take a look!

@Pernekhan
Copy link
Contributor

I'm also seeing this problem. Here is how I reproduce it:

curl -X POST -Z -s --parallel-max 50 -H "Content-Type: application/json" http://$ip:8000/v1/completions?[1-10] -d '@-' <<< "$(jq -n --arg prompt "$(yes 'a' | head -c 4000 | tr -d '\n' | sed 's/a/a /g')" '{"model": "google/gemma-2-9b-it", "temperature": 1.0, "max_tokens": 10, "prompt": $prompt}')"

The script is generating a prompt which contains 'a a a a a a a ...' (~2000 tokens). And sends 10 requests at once. The first request completes successfully, and it seg faults after that. I wasn't able to reproduce it with ~1000 context size.

Here is the output of the first request:

{"id":"cmpl-fdcee3bd9b004af0828d164f0f3eb7b3","object":"text_completion","created":1720568916,"model":"google/gemma-2-9b-it","choices":[{"index":0,"text":"\n\n**This is a simple list.** \n\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":2002,"total_tokens":2012,"completion_tokens":10}}

This is the segmentation fault log:

Fatal Python error: Segmentation fault

Current thread 0x00007fcf597eb640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/decode.py", line 514 in begin_forward
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 149 in begin_forward
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1221 in execute_model
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 271 in execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58 in run
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fcfb27a4640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/vllm/usage/usage_lib.py", line 189 in _report_continous_usage
  File "/usr/local/lib/python3.10/dist-packages/vllm/usage/usage_lib.py", line 141 in _report_usage_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fe222ffd640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fe4ef1ff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fe66f8bf480 (most recent call first):
  File "/usr/lib/python3.10/asyncio/runners.py", line 44 in run
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 65 in run
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 577 in run
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 277 in <module>
  File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
  File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, ujson, regex._regex, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups (total: 103)

@LiuXiaoxuanPKU
Copy link
Collaborator

LiuXiaoxuanPKU commented Jul 11, 2024

Hello folks, we can reproduce the segfault bug from our side.

The bug should be fixed by this. If you need the immediate support, you need to build flashinfer main from source (make sure you clean all build before rebuilding).

We will work closely with the flashinfer team to integrate flashinfer's new release.

@LiuXiaoxuanPKU
Copy link
Collaborator

Please let me know if the issue is not resolved.

@ciarancourtney
Copy link

Yep flashinfer v0.0.9 fixes my segfault 👍

@federicotorrielli
Copy link
Author

Thanks! Fixed everything for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants