Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use GPU? #576

Open
imwide opened this issue Aug 5, 2023 · 21 comments
Open

How to use GPU? #576

imwide opened this issue Aug 5, 2023 · 21 comments
Labels
build hardware Hardware specific issue

Comments

@imwide
Copy link

imwide commented Aug 5, 2023

I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM
This is my code:

from llama_cpp import Llama
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048)
def generate(params):
    print(params["promt"])
    output = llm(params["promt"], max_tokens=params["max_tokens"], stop=params["stop"], echo=params["echo"])

This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...

@mzen17
Copy link
Contributor

mzen17 commented Aug 8, 2023

You need to use n_gpu_layers in the initialization of Llama(), which offloads some of the work to the GPU. If you have enough VRAM, just put an arbitarily high number, or decrease it until you don't get out of VRAM errors.

For example,
llm = Llama(model_path="./wizard-mega-13B.ggmlv3.q4_0.bin", n_ctx=2048, n_gpu_layers=30
API Reference

Also, to get GPU, you need to pip install it from source (might need the Cudatoolkit)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python [Copied from the README]

@imwide
Copy link
Author

imwide commented Aug 8, 2023

Thank you mzen. when i run the command for installing it from source, i get an error. (btw i have cudatoolkit installed)

--- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "9.0.176")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.
      
      
      
        Compiler output:
      
      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:249 (enable_language)
      
      
      -- Configuring incomplete, errors occurred!
      Traceback (most recent call last):
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/setuptools_wrap.py", line 666, in setup
          env = cmkr.configure(
        File "/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/cmaker.py", line 357, in configure
          raise SKBuildError(msg)
      
      An error occurred while configuring with CMake.
        Command:
          /tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/cmake/data/bin/cmake /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45 -G Ninja -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja --no-warn-unused-cli -DCMAKE_INSTALL_PREFIX:PATH=/tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-install -DPYTHON_VERSION_STRING:STRING=3.10.12 -DSKBUILD:INTERNAL=TRUE -DCMAKE_MODULE_PATH:PATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/skbuild/resources/cmake -DPYTHON_EXECUTABLE:PATH=/usr/bin/python3 -DPYTHON_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPYTHON_LIBRARY:PATH=/usr/lib/x86_64-linux-gnu/libpython3.10.so -DPython_EXECUTABLE:PATH=/usr/bin/python3 -DPython_ROOT_DIR:PATH=/usr -DPython_FIND_REGISTRY:STRING=NEVER -DPython_INCLUDE_DIR:PATH=/usr/include/python3.10 -DPython3_EXECUTABLE:PATH=/usr/bin/python3 -DPython3_ROOT_DIR:PATH=/usr -DPython3_FIND_REGISTRY:STRING=NEVER -DPython3_INCLUDE_DIR:PATH=/usr/include/python3.10 -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-4u5rg2oq/overlay/local/lib/python3.10/dist-packages/ninja/data/bin/ninja -DLLAMA_CUBLAS=on -DCMAKE_BUILD_TYPE:STRING=Release -DLLAMA_CUBLAS=on
        Source directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45
        Working directory:
          /tmp/pip-install-eeziyff4/llama-cpp-python_7c9fe262c5904a37b508cec72f0b7d45/_skbuild/linux-x86_64-3.10/cmake-build
      Please see CMake's output for more information.
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Any idea how to fix this?
I saw it says failed to detect default cuda architechture, eventhough i have cuda installed. when doing "torch.cuda.is_available()" it returns True....

@mzen17
Copy link
Contributor

mzen17 commented Aug 8, 2023

Pytorch comes with its own CUDA, so it is likely something with your CUDA.

What version of Cudatoolkit do you use?

@imwide
Copy link
Author

imwide commented Aug 8, 2023

using nvcc --version this is the output:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Also for some reason the installation did jsut work... but it still says BLAS=0 and work is not done on my gpu even though i have set 40 gpu layers...

@mzen17
Copy link
Contributor

mzen17 commented Aug 8, 2023

Forgot to mention, but make sure you set the env variable FORCE_CMAKE to 1 before running the install.

On Linux, the command would be
export FORCE_CMAKE=1

If you are on windows, it should be SET?
set FORCE_CMAKE=1

@gjmulder gjmulder added build hardware Hardware specific issue labels Aug 9, 2023
@radames
Copy link

radames commented Aug 20, 2023

thanks for the information here, also the Dockerfile example was very helpful.
I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others
https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

@bash-bandicoot
Copy link

@radames, can you share the docker run command?

@radames
Copy link

radames commented Aug 28, 2023

hi @kirkog86 you can try this

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all \
	-e HF_HOME="/data/.huggingface" \
	-e REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
	-e MODEL_FILE="llama-2-7b-chat.ggmlv3.q5_0.bin" \
	registry.hf.space/spacesexamples-llama-cpp-python-cuda-gradio:latest

@bash-bandicoot
Copy link

Thanks, @radames! Works very well including API.
By the way, any suggestions on the faster model, provided I have enough HW?

@radames
Copy link

radames commented Aug 29, 2023

hi @kirkog86 , you'll have to play around, you can change llama-cpp params to adapt to your specific HW. In my Docker example, I haven't exposed the param, but you could change n_gpu_layers You can also explore additional-options

@YogeshTembe
Copy link

@mzen17 @radames I tried following commands on windows but gpu is not utilised.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

  1. set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
  2. set FORCE_CMAKE=1
  3. pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

can you please let me know if anything is missing in steps.

@radames
Copy link

radames commented Oct 16, 2023

@YogeshTembe
Copy link

@radames Yes I have followed the same.
We just need to set one variable right ? => CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"

@imwide
Copy link
Author

imwide commented Oct 17, 2023

@radames
DONT just run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

But try CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
It worked for me with the same issue...

@streetycat
Copy link

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

@streetycat
Copy link

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.

I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

#828

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023
Fixes building for x86 processors missing F16C featureset
MSVC not included, as in MSVC F16C is implied with AVX2/AVX512
@JimmyJIA-02
Copy link

the problem I met here is that I can successfully install it and run it. But once I have BLAS equals to 1, the llm no longer generate any response to my prompt, it is wired.

@ankshith
Copy link

ankshith commented Dec 11, 2023

1> I was facing similar issue, so what i did was that i installed CUDA v11.8 and cuDNN v8.9.6, you need to check the Tensorflow version you are currently using for me 2.10.0 worked, versions ^2.10 failed for me, and python 3.11.0

2> You need to create a folder in C drive and name the folder cuda or cuDNN as per your wish, then extract the files from the
downloaded cuDNN zip in that folder, then go to environment variables and edit PATH,

  1. C:\cuDNN\bin,
  2. C:\cuDNN\include,
    3.C:\cuDNN\lib\x64

this are paths that you need to set

3> Also after installing CUDA, you also have to set paths in environment variable,

1. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\extras\CUPTI\lib64
 2. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include

4> Once doing the above steps you need to install pytorch for cuda 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

5> then install llama-cpp

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && pip install --verbose --force-reinstall --no-cache-dir llama-cpp-python==0.1.77

you need to add the above complete line if you want the gpu to work

The above steps worked for me, and i was able to good results with increase in performance.

@hjxy2012
Copy link

hjxy2012 commented Jan 13, 2024

thanks for the information here, also the Dockerfile example was very helpful. I have a fully functional demo running with Gradio UI and GPU here if this is helpful for others https://huggingface.co/spaces/SpacesExamples/llama-cpp-python-cuda-gradio

How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily.
I start this docker as follow:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
docker build -t llama-cpp-python-cuda docker/cuda_simple/
docker run --gpus all --rm -it -p 8000:8000 -v /models/llama:/models -e MODEL=/models/llama-2-7b-chat.Q4_0.gguf llama-cpp-python-cuda

and the performance:

llama_print_timings:        load time =  6922.67 ms
llama_print_timings:      sample time =    33.68 ms /    83 runs   (    0.41 ms per token,  2464.44 tokens per second)
llama_print_timings: prompt eval time =  6922.56 ms /   185 tokens (   37.42 ms per token,    26.72 tokens per second)
llama_print_timings:        eval time = 10499.28 ms /    82 runs   (  128.04 ms per token,     7.81 tokens per second)
llama_print_timings:       total time = 17853.78 ms

and the performce with cpu only:

llama_print_timings:        load time =  6582.30 ms
llama_print_timings:      sample time =    22.01 ms /    56 runs   (    0.39 ms per token,  2544.30 tokens per second)
llama_print_timings: prompt eval time =  6582.18 ms /   175 tokens (   37.61 ms per token,    26.59 tokens per second)
llama_print_timings:        eval time =  7019.08 ms /    55 runs   (  127.62 ms per token,     7.84 tokens per second)
llama_print_timings:       total time = 13941.88 ms

Ok, I have finished it.

#828

I run the same docker command on Windows 11. The llama-cpp-python-cuda image was created successfully.
But after I started the docker container and typed http://localhost:8000 in my browser, I got "{"detail": "Not Found"}".
Is there anything wrong?
The log in the docker container as Follows:
INFO: 172.17.0.1:55544 - "GET / HTTP/1.1" 404 Not Found

I got the point. The requested URL is not right. The right URL is "http://localhost:8000/docs". Thank you.

@tomasruizt
Copy link

tomasruizt commented Jun 18, 2024

For me the GPU was only recognized after passing a lot more parameters to pip install:

CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.5 -DCUDAToolkit_ROOT=/usr/local/cuda-12.5 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.5/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir

Note that I'm using CUDA 12.5

Source: Medium Post

@BinhPQ2
Copy link

BinhPQ2 commented Sep 27, 2024

I got it to works just like instruction, I'm using CUDA 12.3:

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build hardware Hardware specific issue
Projects
None yet
Development

No branches or pull requests

13 participants