How to use GPU? #576

imwide · 2023-08-05T20:02:31Z

I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM
This is my code:

from llama_cpp import Llama
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048)
def generate(params):
    print(params["promt"])
    output = llm(params["promt"], max_tokens=params["max_tokens"], stop=params["stop"], echo=params["echo"])

This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...

The text was updated successfully, but these errors were encountered:

mzen17 · 2023-08-08T08:00:58Z

You need to use n_gpu_layers in the initialization of Llama(), which offloads some of the work to the GPU. If you have enough VRAM, just put an arbitarily high number, or decrease it until you don't get out of VRAM errors.

For example,
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048, n_gpu_layers=30
API Reference

Also, to get GPU, you need to pip install it from source (might need the Cudatoolkit)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python [Copied from the README]

imwide · 2023-08-08T09:31:46Z

Thank you mzen. when i run the command for installing it from source, i get an error. (btw i have cudatoolkit installed)

--- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "9.0.176")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.
      
      
      
        Compiler output:
      
      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:249 (enable_language)
      
      
      -- Configuring incomplete, errors occurred!
      Traceback (most recent call last):
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/setuptools_wrap.py", line 666, in setup
          env = cmkr.configure(
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/cmaker.py", line 357, in configure
          raise SKBuildError(msg)
      
      An error occurred while configuring with CMake.
        Command:
          /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/bin/cmake /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45 -G Ninja -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja --no-warn-unused-cli -DCMAKE_INSTALL_PREFIX:PATH=/tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-install -DPYTHON_VERSION_STRING:STRING=3.10.12 -DSKBUILD:INTERNAL=TRUE -DCMAKE_MODULE_PATH:PATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/resources/cmake -DPYTHON_EXECUTABLE:PATH=/usr/bin/python3 -DPYTHON_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPYTHON_LIBRARY:PATH=/usr/lib/x86_64-linux-gnu/libpython3.10.so -DPython_EXECUTABLE:PATH=/usr/bin/python3 -DPython_ROOT_DIR:PATH=/usr -DPython_FIND_REGISTRY:STRING=NEVER -DPython_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPython3_EXECUTABLE:PATH=/usr/bin/python3 -DPython3_ROOT_DIR:PATH=/usr -DPython3_FIND_REGISTRY:STRING=NEVER -DPython3_INCLUDE_DIR:PATH=/usr/include/python3.10 -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja -DLLAMA_CUBLAS=on -DCMAKE_BUILD_TYPE:STRING=Release -DLLAMA_CUBLAS=on
        Source directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45
        Working directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-build
      Please see CMake's output for more information.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Any idea how to fix this?
I saw it says failed to detect default cuda architechture, eventhough i have cuda installed. when doing "torch.cuda.is_available()" it returns True....

mzen17 · 2023-08-08T17:36:42Z

Pytorch comes with its own CUDA, so it is likely something with your CUDA.

What version of Cudatoolkit do you use?

imwide · 2023-08-08T18:44:15Z

using nvcc --version this is the output:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Also for some reason the installation did jsut work... but it still says BLAS=0 and work is not done on my gpu even though i have set 40 gpu layers...

mzen17 · 2023-08-08T20:01:28Z

Forgot to mention, but make sure you set the env variable FORCE_CMAKE to 1 before running the install.

On Linux, the command would be
export FORCE_CMAKE=1

If you are on windows, it should be SET?
set FORCE_CMAKE=1

radames · 2023-08-20T20:13:11Z

thanks for the information here, also the Dockerfile example was very helpful.
I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others
https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

bash-bandicoot · 2023-08-28T18:14:08Z

@radames, can you share the docker run command?

radames · 2023-08-28T18:22:40Z

hi @kirkog86 you can try this

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
	-e HF_HOME="/data/.huggingface" \
	-e REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
	-e MODEL_FILE="llama-2-7b-chat.ggmlv3.q5_0.bin" \
	registry.hf.space/spacesexamples-llama-cpp-python-cuda-gradio:latest

bash-bandicoot · 2023-08-28T19:39:33Z

Thanks, @radames! Works very well including API.
By the way, any suggestions on the faster model, provided I have enough HW?

radames · 2023-08-29T05:24:47Z

hi @kirkog86 , you'll have to play around, you can change llama-cpp params to adapt to your specific HW. In my Docker example, I haven't exposed the param, but you could change n_gpu_layers You can also explore additional-options

YogeshTembe · 2023-10-14T14:57:07Z

@mzen17 @radames I tried following commands on windows but gpu is not utilised.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

can you please let me know if anything is missing in steps.

radames · 2023-10-16T17:36:35Z

@YogeshTembe are you following this https://github.com/abetlen/llama-cpp-python#windows-remarks ?

YogeshTembe · 2023-10-17T04:40:46Z

@radames Yes I have followed the same.
We just need to set one variable right ? => CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"

imwide · 2023-10-17T09:12:35Z

@radames
DONT just run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

But try CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
It worked for me with the same issue...

streetycat · 2023-10-18T08:18:32Z

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

streetycat · 2023-10-19T10:36:46Z

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

#828

Fixes building for x86 processors missing F16C featureset MSVC not included, as in MSVC F16C is implied with AVX2/AVX512

JimmyJIA-02 · 2023-11-10T07:40:07Z

the problem I met here is that I can successfully install it and run it. But once I have BLAS equals to 1, the llm no longer generate any response to my prompt, it is wired.

ankshith · 2023-12-11T10:39:54Z

1> I was facing similar issue, so what i did was that i installed CUDA v11.8 and cuDNN v8.9.6, you need to check the Tensorflow version you are currently using for me 2.10.0 worked, versions ^2.10 failed for me, and python 3.11.0

2> You need to create a folder in C drive and name the folder cuda or cuDNN as per your wish, then extract the files from the
downloaded cuDNN zip in that folder, then go to environment variables and edit PATH,

C:\cuDNN\bin,
C:\cuDNN\include,
3.C:\cuDNN\lib\x64

this are paths that you need to set

3> Also after installing CUDA, you also have to set paths in environment variable,

1. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\CUPTI\lib64
 2. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include

4> Once doing the above steps you need to install pytorch for cuda 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

5> then install llama-cpp

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && pip install --verbose --force-reinstall --no-cache-dir llama-cpp-python==0.1.77

you need to add the above complete line if you want the gpu to work

The above steps worked for me, and i was able to good results with increase in performance.

hjxy2012 · 2024-01-13T00:27:48Z

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.
I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

#828

I run the same docker command on Windows 11. The llama-cpp-python-cuda image was created successfully.
But after I started the docker container and typed http://localhost:8000 in my browser, I got "{"detail": "Not Found"}".
Is there anything wrong?
The log in the docker container as Follows:
INFO: 172.17.0.1:55544 - "GET / HTTP/1.1" 404 Not Found

I got the point. The requested URL is not right. The right URL is "http://localhost:8000/docs". Thank you.

tomasruizt · 2024-06-18T13:50:08Z

For me the GPU was only recognized after passing a lot more parameters to pip install:

CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.5 -DCUDAToolkit_ROOT=/usr/local/cuda-12.5 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.5/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir

Note that I'm using CUDA 12.5

Source: Medium Post

BinhPQ2 · 2024-09-27T18:44:29Z

I got it to works just like instruction, I'm using CUDA 12.3:

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

gjmulder added build hardware Hardware specific issue labels Aug 9, 2023

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023

cmake : add explicit F16C option (x86) (abetlen#576)

585d91a

Fixes building for x86 processors missing F16C featureset MSVC not included, as in MSVC F16C is implied with AVX2/AVX512

utility-aagrawal mentioned this issue Oct 31, 2023

Can't make llama-cpp-python run with GPU on an AWS EC2 instance! #856

Closed

This was referenced Mar 29, 2024

llama-cpp-python bindings not working for multiple GPUs #1310

Open

llama.cpp Python bindings not working for multiple GPUs ggerganov/llama.cpp#6360

Closed

AnirudhJM24 mentioned this issue Oct 2, 2024

Why is this not working for the current release. UNABLE TO USE GPU #1781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use GPU? #576

How to use GPU? #576

imwide commented Aug 5, 2023 •

edited

Loading

mzen17 commented Aug 8, 2023

imwide commented Aug 8, 2023 •

edited

Loading

mzen17 commented Aug 8, 2023

imwide commented Aug 8, 2023 •

edited

Loading

mzen17 commented Aug 8, 2023 •

edited

Loading

radames commented Aug 20, 2023

bash-bandicoot commented Aug 28, 2023

radames commented Aug 28, 2023

bash-bandicoot commented Aug 28, 2023

radames commented Aug 29, 2023

YogeshTembe commented Oct 14, 2023

radames commented Oct 16, 2023

YogeshTembe commented Oct 17, 2023

imwide commented Oct 17, 2023

streetycat commented Oct 18, 2023

streetycat commented Oct 19, 2023

JimmyJIA-02 commented Nov 10, 2023

ankshith commented Dec 11, 2023 •

edited

Loading

hjxy2012 commented Jan 13, 2024 •

edited

Loading

tomasruizt commented Jun 18, 2024 •

edited

Loading

BinhPQ2 commented Sep 27, 2024

How to use GPU? #576

How to use GPU? #576

Comments

imwide commented Aug 5, 2023 • edited Loading

mzen17 commented Aug 8, 2023

imwide commented Aug 8, 2023 • edited Loading

mzen17 commented Aug 8, 2023

imwide commented Aug 8, 2023 • edited Loading

mzen17 commented Aug 8, 2023 • edited Loading

radames commented Aug 20, 2023

bash-bandicoot commented Aug 28, 2023

radames commented Aug 28, 2023

bash-bandicoot commented Aug 28, 2023

radames commented Aug 29, 2023

YogeshTembe commented Oct 14, 2023

radames commented Oct 16, 2023

YogeshTembe commented Oct 17, 2023

imwide commented Oct 17, 2023

streetycat commented Oct 18, 2023

streetycat commented Oct 19, 2023

JimmyJIA-02 commented Nov 10, 2023

ankshith commented Dec 11, 2023 • edited Loading

hjxy2012 commented Jan 13, 2024 • edited Loading

tomasruizt commented Jun 18, 2024 • edited Loading

BinhPQ2 commented Sep 27, 2024

imwide commented Aug 5, 2023 •

edited

Loading

imwide commented Aug 8, 2023 •

edited

Loading

imwide commented Aug 8, 2023 •

edited

Loading

mzen17 commented Aug 8, 2023 •

edited

Loading

ankshith commented Dec 11, 2023 •

edited

Loading

hjxy2012 commented Jan 13, 2024 •

edited

Loading

tomasruizt commented Jun 18, 2024 •

edited

Loading