Support cublas computation without requiring CUDA installed #350

AsakusaRinne · 2023-12-09T08:54:11Z

related issue: #345

martindevans · 2023-12-09T14:51:52Z

llama.cpp copies cudart, cublas and cublasLt64 into the release package.

See here: https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml#L497

AsakusaRinne · 2023-12-09T17:19:37Z

What about splitting the packages into three kinds in the future?

LLamaSharp.Core: only the main package of LLamaSharp, no native library.
LLamaSharp: main package and all native libraries.
LLamaSharp.Cuda: main package, native libraries and cublas dependencies, which does not require users to have cuda installed.

SignalRT · 2023-12-10T12:04:09Z

@AsakusaRinne We should think also on LLaVa (or more generic any multimodal model). Right now this is not managed by the llama.cpp libraries but with an additional library. Until now there is no GPU so it's simple to go with only one package or integrating that DLL on the native libraries, but I suppose that in the near future this will change.

AsakusaRinne · 2023-12-10T13:38:58Z

@SignalRT I noticed that LLaVA requires another model named CLIP and llama.cpp doesn't provide C APIs for LLaVA but only an example. Do you have any idea about how to support it?

SignalRT · 2023-12-10T14:00:31Z

@AsakusaRinne Yes, I have a version working, but not yet integrated cleanly on the currrent Executor architecture.

martindevans · 2023-12-10T15:45:14Z

but not yet integrated cleanly on the currrent Executor architecture.

Just a note on that, the current executors will probably be replaced at some point when we swap over to batch decoding. Just a heads up so you don't spend a tonne of effort on that!

what about splitting the packages into three kinds in the future?

If we're going to split it I'd say there should be an option for CPU-only inference to be installed without any CUDA backend. Just because the CUDA binaries are huge.

Onkitova · 2023-12-11T11:37:31Z

Hello again.
Was tinkering a little bit with https://github.com/sandrohanea/whisper.net which is also a c# binding, but for another ggerganov project -- whisper.cpp -- not llama.cpp.
The thing is: it also utilizes cuda through cublas. At first, it looked like it won't work without CUDA Toolkit installed as well (our current state), but then I tried a method from issue discussion and... it accidentally worked.

What I want to say is that we can do the same thing and possibly not bloat LlamaSharp with another ~400mb nuget of nvidia libraries. Instead, we possibly can:

tweak loading libraries code a little bit to allow loading these necessary nvidia libraries directly from application folder
redistribute those libraries as an archive, like ggerganov does with every new release
add a little how-to notice about cuda utilization to README.md

AsakusaRinne · 2023-12-11T14:38:58Z

tweak loading libraries code a little bit to allow loading these necessary nvidia libraries directly from application folder

I'm not sure if it will work to specify a path to load them. Generally if these files are not added to PATH environment variable, we should put them the same folder with the library that depends on them. I'll have a try of it :)

redistribute those libraries as an archive, like ggerganov does with every new release and add a little how-to notice about cuda utilization to README.md

It's a good idea. As you mentioned, yes, the sizes of these files are very large, which I ignored before.

Onkitova · 2023-12-12T10:44:25Z

I'm not sure if it will work to specify a path to load them.

It might work, because -- as I mentioned above -- this is how it works for original llama.cpp and whisper.cpp as well (and it works the same way even in c# binding). Possibly we don't even need specifying path, just not mess with how ggerganov library (bin-win-cublas-cu12.2.0-x64.zip version for example) handles looking for nvidia libraries nearby on launch.

Another example:

llama.cpp original repo, "bin-win-cublas-cu12.2.0-x64.zip" version
CUDA Toolkit is not installed, system PATH variables (regarding CUDA) are not presented at all

Don't get me wrong: not rushing you in any way or direction -- just want to point out the findings.

Using CUDA while decoupling from the CUDA Toolkit as a hard-dependency Possible solution for SciSharp#350 Adding an alternative, fallback method of detection system-supported cuda version to make CUDA Toolkit installation optional. Technically, it uses output of the command line tool "nvidia-smi" (preinstalled with nvidia drivers), which also contains information about cuda version supported on system. Can confirm it works only on Windows, but I suppose that similar approach can be utilized for Linux and MacOS as well. Didn't touch the code for these 2 platforms, nevertheless. After that, cuda can be utilized simply by putting nvidia libraries from llama.cpp original repo, "bin-win-cublas-cu12.2.0-x64.zip" asset to the root folder of the built program. For example, to folder: "\LLama.Examples\bin\Debug\net8.0\".

martindevans · 2023-12-14T14:42:01Z

I've started #361 with an update for new llama.cpp binaries. I'm not sure I'll get time to investigate packaging up the cudart binaries myself, but if anyone wants to work on a PR to modify the Compile Binaries action (this one) to package whatever is necessary I'll be happy to modify my PR to use the new binaries.

Onkitova · 2023-12-15T07:15:09Z

I've started #361 with an update for new llama.cpp binaries. I'm not sure I'll get time to investigate packaging up the cudart binaries myself, but if anyone wants to work on a PR to modify the Compile Binaries action (this one) to package whatever is necessary I'll be happy to modify my PR to use the new binaries.

I'd be happy to help with this, but I have no experience at all with github's ci/cd stuff, and so I'm not going to try doing that part so as not to make a fuss with potentially incorrect code.

All I can say is that we need to just re-distribute the same archives with CUDA libraries as ggerganov does with each release of llama.cpp. Right now he is publishing 2 archives: cuda11 binaries (cudart-llama-bin-win-cu11.7.1-x64.zip) and cuda 12 (cudart-llama-bin-win-cu12.2.0-x64.zip). Perhaps you can even see how it is implemented in his alternative of "Compile Binaries" action. Same archives with each release, without changes. Binaries in these archives are taken from CUDA Toolkit installation folder/bin, to be said.

After that, we just need to add an instruction to readme.md regarding CUDA utilization. Something like: "if you want to utilize the power of nvidia graphics cards, there are 2 ways:
a) Installing CUDA Toolkit on the developer's and end-users' computers as dependencies
OR
b) Downloading the archives with the libraries and putting them into project files with the "Copy to output" file property set to if newer/always value."

martindevans · 2023-12-15T14:52:18Z

Is it possible to redistribute both sets of runtimes in the nuget packages? That way there are no extra manual steps required.

I'd be happy to help with this, but I have no experience at all with github's ci/cd stuff, and so I'm not going to try doing that part so as not to make a fuss with potentially incorrect code.

That's not a problem, I'm happy to make the actual changes to the GitHub action, the main thing that I need help with is knowing what changes to make!

For example, you could try putting the cudart files into the runtimes/deps folder and modifying the targets file (to specify what goes where at build time). If that then runs without the CUDA toolkit installed that'll confirm everything works and we can distribute that.

Once we know what files need to go where I can modify the build action to put those files in the right place automatically.

Related to: SciSharp#350 Nvidia CUDA binaries are taken from archives: - CUDA 11 (cudart-llama-bin-win-cu11.7.1-x64.zip) - CUDA 12 (cudart-llama-bin-win-cu12.2.0-x64.zip) from the latest (at the moment of writing this) build of ggerganov's [llama.cpp](https://github.com/ggerganov/llama.cpp/releases/tag/b1643). Editing .nuspec at this point is discussible, however.

Onkitova · 2023-12-16T01:46:43Z

Is it possible to redistribute both sets of runtimes in the nuget packages?

I guess: yes.
There is 1 problem however, please check out recent pull request #371 for details.

martindevans · 2023-12-16T15:54:00Z

Thanks for looking into that

martindevans · 2024-01-07T02:21:48Z

Over in #371 Onkitova investigated using cudart, which seems to work. However the files are huge, so we don't want to include them in this repo or in our cuda nuget packages. Instead we decided on releasing separate nuget packages with the cudart binaries in them, which LLamaSharp can then depend on:

Cuda11.7.1.runtime:

cu11.7.1/cublas64_11(.dll/.so)

cu11.7.1/cublasLt64_11(.dll/.so)

cu11.7.1/cudart64_110(.dll/.so)

Cuda12.1.0.runtime

cu12.1.0/cublas64_12(.dll/.so)

cu12.1.0/cublasLt64_12(.dll/.so)

cu12.1.0/cudart64_12(.dll/.so)

Someone will need to take on this work:

Create a GH action which will download the cuda runtimes (it can just be manually triggered)
Package them into nuget packages
Push to nuget

This existing action should serve as an example of how to do most of this.

AsakusaRinne added the enhancement New feature or request label Dec 9, 2023

AsakusaRinne added this to LLamaSharp Dev Dec 9, 2023

AsakusaRinne mentioned this issue Dec 9, 2023

v0.8.1 -- Unable to utilize CUDA in any way #345

Closed

Onkitova mentioned this issue Dec 14, 2023

Using CUDA while decoupling from the CUDA Toolkit as a hard-dependency #365

Merged

martindevans mentioned this issue Dec 15, 2023

Updated Binaries December 2023 #361

Merged

5 tasks

Onkitova mentioned this issue Dec 16, 2023

Issue #350 development #371

Closed

AsakusaRinne mentioned this issue Dec 16, 2023

Cannot get CUDA backend to work. CPU backend does work. #374

Closed

DM-98 mentioned this issue Dec 30, 2023

Unable to run example Project #323

Open

martindevans added good first issue Good for newcomers help wanted Extra attention is needed labels Jan 7, 2024

martindevans moved this to 📋 TODO in LLamaSharp Dev Jan 7, 2024

martindevans mentioned this issue Mar 9, 2024

Separating and Streamlining llama/llava binaries Suggestion #583

Open

martindevans mentioned this issue Apr 12, 2024

Examples don't run with CUDA12 #599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cublas computation without requiring CUDA installed #350

Support cublas computation without requiring CUDA installed #350

AsakusaRinne commented Dec 9, 2023

martindevans commented Dec 9, 2023

AsakusaRinne commented Dec 9, 2023

SignalRT commented Dec 10, 2023 •

edited

Loading

AsakusaRinne commented Dec 10, 2023

SignalRT commented Dec 10, 2023

martindevans commented Dec 10, 2023

Onkitova commented Dec 11, 2023

AsakusaRinne commented Dec 11, 2023

Onkitova commented Dec 12, 2023

martindevans commented Dec 14, 2023

Onkitova commented Dec 15, 2023

martindevans commented Dec 15, 2023

Onkitova commented Dec 16, 2023

martindevans commented Dec 16, 2023

martindevans commented Jan 7, 2024 •

edited

Loading

Support cublas computation without requiring CUDA installed #350

Support cublas computation without requiring CUDA installed #350

Comments

AsakusaRinne commented Dec 9, 2023

martindevans commented Dec 9, 2023

AsakusaRinne commented Dec 9, 2023

SignalRT commented Dec 10, 2023 • edited Loading

AsakusaRinne commented Dec 10, 2023

SignalRT commented Dec 10, 2023

martindevans commented Dec 10, 2023

Onkitova commented Dec 11, 2023

AsakusaRinne commented Dec 11, 2023

Onkitova commented Dec 12, 2023

martindevans commented Dec 14, 2023

Onkitova commented Dec 15, 2023

martindevans commented Dec 15, 2023

Onkitova commented Dec 16, 2023

martindevans commented Dec 16, 2023

martindevans commented Jan 7, 2024 • edited Loading

SignalRT commented Dec 10, 2023 •

edited

Loading

martindevans commented Jan 7, 2024 •

edited

Loading