-
Notifications
You must be signed in to change notification settings - Fork 970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use GPU? #576
Comments
You need to use n_gpu_layers in the initialization of Llama(), which offloads some of the work to the GPU. If you have enough VRAM, just put an arbitarily high number, or decrease it until you don't get out of VRAM errors. For example, Also, to get GPU, you need to pip install it from source (might need the Cudatoolkit) |
Thank you mzen. when i run the command for installing it from source, i get an error. (btw i have cudatoolkit installed)
Any idea how to fix this? |
Pytorch comes with its own CUDA, so it is likely something with your CUDA. What version of Cudatoolkit do you use? |
using nvcc --version this is the output: Also for some reason the installation did jsut work... but it still says BLAS=0 and work is not done on my gpu even though i have set 40 gpu layers... |
Forgot to mention, but make sure you set the env variable FORCE_CMAKE to 1 before running the install. On Linux, the command would be If you are on windows, it should be SET? |
thanks for the information here, also the Dockerfile example was very helpful. |
@radames, can you share the docker run command? |
hi @kirkog86 you can try this
|
Thanks, @radames! Works very well including API. |
hi @kirkog86 , you'll have to play around, you can change llama-cpp params to adapt to your specific HW. In my Docker example, I haven't exposed the param, but you could change |
@mzen17 @radames I tried following commands on windows but gpu is not utilised. CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
can you please let me know if anything is missing in steps. |
@YogeshTembe are you following this https://github.com/abetlen/llama-cpp-python#windows-remarks ? |
@radames Yes I have followed the same. |
@radames But try |
How is the performance? I start a servier with this docker, but I didn't found it's faster the cpu, and CPU is also occupied heavily. I start this docker as follow:
and the performance:
and the performce with cpu only:
|
Ok, I have finished it. |
Fixes building for x86 processors missing F16C featureset MSVC not included, as in MSVC F16C is implied with AVX2/AVX512
the problem I met here is that I can successfully install it and run it. But once I have BLAS equals to 1, the llm no longer generate any response to my prompt, it is wired. |
1> I was facing similar issue, so what i did was that i installed CUDA v11.8 and cuDNN v8.9.6, you need to check the Tensorflow version you are currently using for me 2.10.0 worked, versions ^2.10 failed for me, and python 3.11.0 2> You need to create a folder in C drive and name the folder cuda or cuDNN as per your wish, then extract the files from the
this are paths that you need to set 3> Also after installing CUDA, you also have to set paths in environment variable,
4> Once doing the above steps you need to install pytorch for cuda 11.8 5> then install llama-cpp
you need to add the above complete line if you want the gpu to work The above steps worked for me, and i was able to good results with increase in performance. |
I run the same docker command on Windows 11. The llama-cpp-python-cuda image was created successfully. I got the point. The requested URL is not right. The right URL is "http://localhost:8000/docs". Thank you. |
For me the GPU was only recognized after passing a lot more parameters to
Note that I'm using CUDA 12.5 Source: Medium Post |
I got it to works just like instruction, I'm using CUDA 12.3:
|
I run llama cpp python on my new PC which has a built in RTX 3060 with 12GB VRAM
This is my code:
This code works and I get the results that I want but the inference is terribly slow. for a few tokens it takes up to 10 seconds. How do I minimize this time? I dont think my GPU is doing the heavy lifting here...
The text was updated successfully, but these errors were encountered: