-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cublas computation without requiring CUDA installed #350
Comments
llama.cpp copies cudart, cublas and cublasLt64 into the release package. See here: https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml#L497 |
What about splitting the packages into three kinds in the future?
|
@AsakusaRinne We should think also on LLaVa (or more generic any multimodal model). Right now this is not managed by the llama.cpp libraries but with an additional library. Until now there is no GPU so it's simple to go with only one package or integrating that DLL on the native libraries, but I suppose that in the near future this will change. |
@SignalRT I noticed that LLaVA requires another model named CLIP and llama.cpp doesn't provide C APIs for LLaVA but only an example. Do you have any idea about how to support it? |
@AsakusaRinne Yes, I have a version working, but not yet integrated cleanly on the currrent Executor architecture. |
Just a note on that, the current executors will probably be replaced at some point when we swap over to batch decoding. Just a heads up so you don't spend a tonne of effort on that!
If we're going to split it I'd say there should be an option for CPU-only inference to be installed without any CUDA backend. Just because the CUDA binaries are huge. |
Hello again. What I want to say is that we can do the same thing and possibly not bloat LlamaSharp with another ~400mb nuget of nvidia libraries. Instead, we possibly can: |
I'm not sure if it will work to specify a path to load them. Generally if these files are not added to PATH environment variable, we should put them the same folder with the library that depends on them. I'll have a try of it :)
It's a good idea. As you mentioned, yes, the sizes of these files are very large, which I ignored before. |
It might work, because -- as I mentioned above -- this is how it works for original llama.cpp and whisper.cpp as well (and it works the same way even in c# binding). Possibly we don't even need specifying path, just not mess with how ggerganov library (bin-win-cublas-cu12.2.0-x64.zip version for example) handles looking for nvidia libraries nearby on launch. Another example:
|
Using CUDA while decoupling from the CUDA Toolkit as a hard-dependency Possible solution for SciSharp#350 Adding an alternative, fallback method of detection system-supported cuda version to make CUDA Toolkit installation optional. Technically, it uses output of the command line tool "nvidia-smi" (preinstalled with nvidia drivers), which also contains information about cuda version supported on system. Can confirm it works only on Windows, but I suppose that similar approach can be utilized for Linux and MacOS as well. Didn't touch the code for these 2 platforms, nevertheless. After that, cuda can be utilized simply by putting nvidia libraries from llama.cpp original repo, "bin-win-cublas-cu12.2.0-x64.zip" asset to the root folder of the built program. For example, to folder: "\LLama.Examples\bin\Debug\net8.0\".
I've started #361 with an update for new llama.cpp binaries. I'm not sure I'll get time to investigate packaging up the cudart binaries myself, but if anyone wants to work on a PR to modify the |
I'd be happy to help with this, but I have no experience at all with github's ci/cd stuff, and so I'm not going to try doing that part so as not to make a fuss with potentially incorrect code. All I can say is that we need to just re-distribute the same archives with CUDA libraries as ggerganov does with each release of llama.cpp. Right now he is publishing 2 archives: cuda11 binaries (cudart-llama-bin-win-cu11.7.1-x64.zip) and cuda 12 (cudart-llama-bin-win-cu12.2.0-x64.zip). Perhaps you can even see how it is implemented in his alternative of "Compile Binaries" action. Same archives with each release, without changes. Binaries in these archives are taken from CUDA Toolkit installation folder/bin, to be said. After that, we just need to add an instruction to readme.md regarding CUDA utilization. Something like: "if you want to utilize the power of nvidia graphics cards, there are 2 ways: |
Is it possible to redistribute both sets of runtimes in the nuget packages? That way there are no extra manual steps required.
That's not a problem, I'm happy to make the actual changes to the GitHub action, the main thing that I need help with is knowing what changes to make! For example, you could try putting the cudart files into the runtimes/deps folder and modifying the targets file (to specify what goes where at build time). If that then runs without the CUDA toolkit installed that'll confirm everything works and we can distribute that. Once we know what files need to go where I can modify the build action to put those files in the right place automatically. |
Related to: SciSharp#350 Nvidia CUDA binaries are taken from archives: - CUDA 11 (cudart-llama-bin-win-cu11.7.1-x64.zip) - CUDA 12 (cudart-llama-bin-win-cu12.2.0-x64.zip) from the latest (at the moment of writing this) build of ggerganov's [llama.cpp](https://github.com/ggerganov/llama.cpp/releases/tag/b1643). Editing .nuspec at this point is discussible, however.
I guess: yes. |
Thanks for looking into that |
Over in #371 Onkitova investigated using cudart, which seems to work. However the files are huge, so we don't want to include them in this repo or in our cuda nuget packages. Instead we decided on releasing separate nuget packages with the cudart binaries in them, which LLamaSharp can then depend on:
Someone will need to take on this work:
This existing action should serve as an example of how to do most of this. |
related issue: #345
The text was updated successfully, but these errors were encountered: