How to make it work with CUDA-support? #1691

itinance · 2024-08-18T16:41:52Z

itinance
Aug 18, 2024

I have setup llama-server successfully so that it consumes my RTX 4000 via CUDA (v 11), both via docker and running locally.

But when I want to use the python-bindings (llama-cpp-python), it seems to not utilize the GPU at all, doing everything with CPU only which consumes much time.

I have installed the library with

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

What else do I need in order to enable GPU-support?

Code:

from llama_cpp import Llama

llm = Llama(
      model_path="../models/LongWriter-llama3.1-8b-GGUF/LongWriter-llama3.1-8b-Q4_K_M.gguf",
      n_gpu_layers=33, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

laelhalawani · 2024-08-25T18:30:28Z

laelhalawani
Aug 25, 2024

Try this
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make it work with CUDA-support? #1691

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to make it work with CUDA-support? #1691

itinance Aug 18, 2024

Replies: 1 comment

laelhalawani Aug 25, 2024

itinance
Aug 18, 2024

laelhalawani
Aug 25, 2024