Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error 801: Operation Not Supported #870

Closed
t19cs045-sub opened this issue Nov 4, 2023 · 2 comments
Closed

CUDA Error 801: Operation Not Supported #870

t19cs045-sub opened this issue Nov 4, 2023 · 2 comments

Comments

@t19cs045-sub
Copy link

t19cs045-sub commented Nov 4, 2023

I encountered a CUDA error while running a script that uses the Llama model. The error message is “CUDA error 801 at ggml-cuda.cu:6799: operation not supported”. The current device is 0.

Code Snippet:

def question(message):
# LLM setup
llm = Llama(model_path="./japanese-stablelm-instruct-gamma-7b-q8_0.gguf",
n_gpu_layers=32)

  # Run inference
  output = llm(
      prompt,
      temperature=1,
      top_p=0.95,
      stop=["指示:", "入力:", "応答:"],
      echo=False,
      max_tokens=1024
  )

Error Message:
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 132.92 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.83 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size = 64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.83 MB (model: 7205.83 MB, context: 137.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

CUDA error 801 at ggml-cuda.cu:6799: operation not supported
current device: 0

Environment:

NVIDIA-SMI 545.23.06
Driver Version: 545.23.06
CUDA Version: 12.3
GPU: Nvidia Quadro M4000 8GB
Any help in resolving this issue would be greatly appreciated.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 4, 2023

This is an upstream bug that broke multi-gpu.

@abetlen
Copy link
Owner

abetlen commented Nov 8, 2023

@Ph0rk0z do you have a link to the upstream issue for this?

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 8, 2023

ggerganov/llama.cpp#3930 (comment)

ggerganov/llama.cpp#3944

It's resolved for me. But I think latest refactoring broke llama.cpp_hf in textgen. Something to do with swapping the ctx in the cache.

@abetlen abetlen closed this as completed Nov 10, 2023
@al-fk
Copy link

al-fk commented Aug 1, 2024

I have the same error.
@t19cs045-sub Did you find a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants