Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cublas computation without requiring CUDA installed #350

Open
AsakusaRinne opened this issue Dec 9, 2023 · 15 comments
Open

Support cublas computation without requiring CUDA installed #350

AsakusaRinne opened this issue Dec 9, 2023 · 15 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@AsakusaRinne
Copy link
Collaborator

related issue: #345

@martindevans
Copy link
Member

llama.cpp copies cudart, cublas and cublasLt64 into the release package.

See here: https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml#L497

@AsakusaRinne
Copy link
Collaborator Author

What about splitting the packages into three kinds in the future?

  • LLamaSharp.Core: only the main package of LLamaSharp, no native library.
  • LLamaSharp: main package and all native libraries.
  • LLamaSharp.Cuda: main package, native libraries and cublas dependencies, which does not require users to have cuda installed.

@SignalRT
Copy link
Collaborator

SignalRT commented Dec 10, 2023

@AsakusaRinne We should think also on LLaVa (or more generic any multimodal model). Right now this is not managed by the llama.cpp libraries but with an additional library. Until now there is no GPU so it's simple to go with only one package or integrating that DLL on the native libraries, but I suppose that in the near future this will change.

@AsakusaRinne
Copy link
Collaborator Author

@SignalRT I noticed that LLaVA requires another model named CLIP and llama.cpp doesn't provide C APIs for LLaVA but only an example. Do you have any idea about how to support it?

@SignalRT
Copy link
Collaborator

@AsakusaRinne Yes, I have a version working, but not yet integrated cleanly on the currrent Executor architecture.

@martindevans
Copy link
Member

but not yet integrated cleanly on the currrent Executor architecture.

Just a note on that, the current executors will probably be replaced at some point when we swap over to batch decoding. Just a heads up so you don't spend a tonne of effort on that!

what about splitting the packages into three kinds in the future?

If we're going to split it I'd say there should be an option for CPU-only inference to be installed without any CUDA backend. Just because the CUDA binaries are huge.

@Onkitova
Copy link

Hello again.
Was tinkering a little bit with https://github.com/sandrohanea/whisper.net which is also a c# binding, but for another ggerganov project -- whisper.cpp -- not llama.cpp.
The thing is: it also utilizes cuda through cublas. At first, it looked like it won't work without CUDA Toolkit installed as well (our current state), but then I tried a method from issue discussion and... it accidentally worked.

image

What I want to say is that we can do the same thing and possibly not bloat LlamaSharp with another ~400mb nuget of nvidia libraries. Instead, we possibly can:

  1. tweak loading libraries code a little bit to allow loading these necessary nvidia libraries directly from application folder

  2. redistribute those libraries as an archive, like ggerganov does with every new release
    image

  3. add a little how-to notice about cuda utilization to README.md

@AsakusaRinne
Copy link
Collaborator Author

tweak loading libraries code a little bit to allow loading these necessary nvidia libraries directly from application folder

I'm not sure if it will work to specify a path to load them. Generally if these files are not added to PATH environment variable, we should put them the same folder with the library that depends on them. I'll have a try of it :)

redistribute those libraries as an archive, like ggerganov does with every new release and add a little how-to notice about cuda utilization to README.md

It's a good idea. As you mentioned, yes, the sizes of these files are very large, which I ignored before.

@Onkitova
Copy link

I'm not sure if it will work to specify a path to load them.

It might work, because -- as I mentioned above -- this is how it works for original llama.cpp and whisper.cpp as well (and it works the same way even in c# binding). Possibly we don't even need specifying path, just not mess with how ggerganov library (bin-win-cublas-cu12.2.0-x64.zip version for example) handles looking for nvidia libraries nearby on launch.

Another example:

  • llama.cpp original repo, "bin-win-cublas-cu12.2.0-x64.zip" version
  • CUDA Toolkit is not installed, system PATH variables (regarding CUDA) are not presented at all

Screenshot 2023-12-12 132352
Screenshot 2023-12-12 132506
Don't get me wrong: not rushing you in any way or direction -- just want to point out the findings.

Onkitova added a commit to Onkitova/LLamaSharp that referenced this issue Dec 14, 2023
Using CUDA while decoupling from the CUDA Toolkit as a hard-dependency

Possible solution for SciSharp#350

Adding an alternative, fallback method of detection system-supported cuda version to make CUDA Toolkit installation optional. Technically, it uses output of the command line tool "nvidia-smi" (preinstalled with nvidia drivers), which also contains information about cuda version supported on system.

Can confirm it works only on Windows, but I suppose that similar approach can be utilized for Linux and MacOS as well. Didn't touch the code for these 2 platforms, nevertheless.

After that, cuda can be utilized simply by putting nvidia libraries from llama.cpp original repo, "bin-win-cublas-cu12.2.0-x64.zip" asset to the root folder of the built program. For example, to folder: "\LLama.Examples\bin\Debug\net8.0\".
@martindevans
Copy link
Member

I've started #361 with an update for new llama.cpp binaries. I'm not sure I'll get time to investigate packaging up the cudart binaries myself, but if anyone wants to work on a PR to modify the Compile Binaries action (this one) to package whatever is necessary I'll be happy to modify my PR to use the new binaries.

@Onkitova
Copy link

I've started #361 with an update for new llama.cpp binaries. I'm not sure I'll get time to investigate packaging up the cudart binaries myself, but if anyone wants to work on a PR to modify the Compile Binaries action (this one) to package whatever is necessary I'll be happy to modify my PR to use the new binaries.

I'd be happy to help with this, but I have no experience at all with github's ci/cd stuff, and so I'm not going to try doing that part so as not to make a fuss with potentially incorrect code.

All I can say is that we need to just re-distribute the same archives with CUDA libraries as ggerganov does with each release of llama.cpp. Right now he is publishing 2 archives: cuda11 binaries (cudart-llama-bin-win-cu11.7.1-x64.zip) and cuda 12 (cudart-llama-bin-win-cu12.2.0-x64.zip). Perhaps you can even see how it is implemented in his alternative of "Compile Binaries" action. Same archives with each release, without changes. Binaries in these archives are taken from CUDA Toolkit installation folder/bin, to be said.

After that, we just need to add an instruction to readme.md regarding CUDA utilization. Something like: "if you want to utilize the power of nvidia graphics cards, there are 2 ways:
a) Installing CUDA Toolkit on the developer's and end-users' computers as dependencies
OR
b) Downloading the archives with the libraries and putting them into project files with the "Copy to output" file property set to if newer/always value."

@martindevans
Copy link
Member

Is it possible to redistribute both sets of runtimes in the nuget packages? That way there are no extra manual steps required.

I'd be happy to help with this, but I have no experience at all with github's ci/cd stuff, and so I'm not going to try doing that part so as not to make a fuss with potentially incorrect code.

That's not a problem, I'm happy to make the actual changes to the GitHub action, the main thing that I need help with is knowing what changes to make!

For example, you could try putting the cudart files into the runtimes/deps folder and modifying the targets file (to specify what goes where at build time). If that then runs without the CUDA toolkit installed that'll confirm everything works and we can distribute that.

Once we know what files need to go where I can modify the build action to put those files in the right place automatically.

Onkitova added a commit to Onkitova/LLamaSharp that referenced this issue Dec 16, 2023
Related to: SciSharp#350

Nvidia CUDA binaries are taken from archives:
- CUDA 11 (cudart-llama-bin-win-cu11.7.1-x64.zip)
- CUDA 12 (cudart-llama-bin-win-cu12.2.0-x64.zip)
from the latest (at the moment of writing this) build of ggerganov's [llama.cpp](https://github.com/ggerganov/llama.cpp/releases/tag/b1643).

Editing .nuspec at this point is discussible, however.
@Onkitova
Copy link

Is it possible to redistribute both sets of runtimes in the nuget packages?

I guess: yes.
There is 1 problem however, please check out recent pull request #371 for details.

@martindevans
Copy link
Member

Thanks for looking into that

@martindevans
Copy link
Member

martindevans commented Jan 7, 2024

Over in #371 Onkitova investigated using cudart, which seems to work. However the files are huge, so we don't want to include them in this repo or in our cuda nuget packages. Instead we decided on releasing separate nuget packages with the cudart binaries in them, which LLamaSharp can then depend on:

Cuda11.7.1.runtime:

  • cu11.7.1/cublas64_11(.dll/.so)
  • cu11.7.1/cublasLt64_11(.dll/.so)
  • cu11.7.1/cudart64_110(.dll/.so)

Cuda12.1.0.runtime

  • cu12.1.0/cublas64_12(.dll/.so)
  • cu12.1.0/cublasLt64_12(.dll/.so)
  • cu12.1.0/cudart64_12(.dll/.so)

Someone will need to take on this work:

  1. Create a GH action which will download the cuda runtimes (it can just be manually triggered)
  2. Package them into nuget packages
  3. Push to nuget

This existing action should serve as an example of how to do most of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
Status: 📋 TODO
Development

No branches or pull requests

4 participants