Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: unspecified launch failure on inference on Nvidia V100 GPUs #1624

Open
rplescia opened this issue Jul 26, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@rplescia
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ x] I carefully followed the README.md.
  • [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm running the llama-cpp-python OpenAI-compatible API servers on my VM that has 1x Nvidia V100 16GB GPU allocated to it. The server can start, but once a request is sent to the server, it falls over.

Current Behavior

When the server receives a request it errors with a CUDA error. The error is identical to an issue already reported in the Ollama GitHub page ollama/ollama#5571

/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2422: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900
ggml_cuda_compute_forward: ROPE failed
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2287
err
GGML_ASSERT: /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error"
Aborted (core dumped)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
CPU family: 6
Model: 79
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Stepping: 1
BogoMIPS: 5187.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
Virtualization features:
Hypervisor vendor: Microsoft
Virtualization type: full
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 1.5 MiB (6 instances)
L3: 35 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-5
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Srbds: Not affected
Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown

  • Operating System, e.g. for Linux:

22.04.1-Ubuntu SMP Mon Jun 17 18:38:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
Python 3.10.12
GNU Make 4.3
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/...../models/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 85.94 MiB
llm_load_tensors: CUDA0 buffer size = 4807.05 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 296.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '17'}
Available chat formats from metadata: chat_template.default
Guessed chat format: mistral-instruct
INFO: Started server process [3357]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

@abetlen abetlen added the bug Something isn't working label Aug 7, 2024
@canoalberto
Copy link

I am having the same issue.

The fact that it complains the kernel wasn't compiled for an arch version (700) that is later listed in the compiled list is very LoL.

mmq.cuh:2422: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

Environment:
Rocky Linux 9.3
Python 3.11
NVIDIA 545
CUDA 12.3
V100 32GB (CC 7.0 -> arch 700)

@shaunck96
Copy link

nvidia smi - """+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 37W / 250W | 3096MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-16GB Off | 00000002:00:00.0 Off | Off |
| N/A 34C P0 37W / 250W | 3316MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+""" . llamacpp python version >0.2.85 to leverage llama 3.1. Llama quant version initialization: """ llm = Llama.from_pretrained(
repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
filename="Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf",
n_ctx=4096,
n_gpu_layers=-1
)
""". """ ggml_cuda_compute_forward: ROPE failed
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda.cu:2313
err
/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900
/home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900""". ROPE is failing. For large token count inputs, the above error occurs and inferencing occurs on CPU only. Please advice on resolution

@IncubusRK
Copy link

The issue looks like ollama/ollama#5571

In this case, it will be enough to delete the following line

$env:CMAKE_ARGS = "-DGGML_CUDA_FORCE_MMQ=ON $env:CMAKE_ARGS"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants