Add NVIDIA cuBLAS support #1044

slaren · 2023-04-18T16:41:16Z

Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS.

Build with LLAMA_CUBLAS:

make clean && LLAMA_CUBLAS=1 make

Perplexity seconds per pass (i9 9900k, RTX 3080 10GB)

	7B q4_0	7B f16	7B f32
cuBLAS	8.92	5.24	7.70
OpenBLAS	22.64	24.85	18.18
No BLAS	26.39	30.35	54.33

rabidcopy · 2023-04-18T16:55:43Z

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

ggerganov

Great - and I guess ppl results are similar between non-cuBLAS and cuBLAS?

slaren · 2023-04-18T17:04:50Z

I haven't completed a full run yet, but with 7B q4_0, the perplexity of the first iterations is identical to OpenBLAS. It will probably be higher in f16xf32 because instead of converting to f32xf32, I convert to f16xf16.

slaren · 2023-04-18T20:16:42Z

Perplexity with 7B q4_0 is 6.2838

./perplexity -m models/7B/ggml-model-q4_0.bin -f wikitext-2-raw/wiki.test.raw -t 8 main: seed = 1681837585 llama.cpp: loading model from models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
9.13 seconds per pass - ETA 1.66 hours
[1]4.3798,[2]4.9554,[3]5.8269,[4]6.4695,[5]6.5438,[6]6.5414,[7]6.7175,[8]6.8070,[9]7.1756,[10]7.4121,[11]7.6567,[12]7.6957,[13]7.6057,[14]7.6821,[15]7.9367,[16]7.5419,[17]7.4189,[18]7.3798,[19]7.0077,[20]6.9947,[21]6.8969,[22]6.7124,[23]6.6743,[24]6.5868,[25]6.5871,[26]6.4149,[27]6.2349,[28]6.1341,[29]6.0499,[30]5.8939,[31]5.8660,[32]5.8840,[33]5.8190,[34]5.8538,[35]5.8796,[36]5.9233,[37]5.9272,[38]5.9444,[39]5.9825,[40]6.0413,[41]6.0483,[42]6.0827,[43]6.0398,[44]6.0945,[45]6.0989,[46]6.0730,[47]6.0968,[48]6.0675,[49]6.0746,[50]6.0352,[51]6.0311,[52]6.0201,[53]6.0642,[54]6.0477,[55]6.0251,[56]6.0595,[57]6.0826,[58]6.1044,[59]6.1183,[60]6.1648,[61]6.1537,[62]6.2167,[63]6.2503,[64]6.2654,[65]6.3119,[66]6.3221,[67]6.3402,[68]6.3542,[69]6.3791,[70]6.4114,[71]6.4328,[72]6.4626,[73]6.5278,[74]6.5331,[75]6.5475,[76]6.5638,[77]6.5771,[78]6.5619,[79]6.5915,[80]6.5840,[81]6.5968,[82]6.6005,[83]6.5468,[84]6.5323,[85]6.5209,[86]6.4998,[87]6.4344,[88]6.4060,[89]6.3854,[90]6.3688,[91]6.3949,[92]6.3910,[93]6.3936,[94]6.3911,[95]6.4199,[96]6.4178,[97]6.4106,[98]6.4036,[99]6.3896,[100]6.3896,[101]6.4155,[102]6.4092,[103]6.4309,[104]6.4377,[105]6.4362,[106]6.4539,[107]6.4526,[108]6.4649,[109]6.4596,[110]6.4551,[111]6.4780,[112]6.4970,[113]6.4984,[114]6.4950,[115]6.5033,[116]6.4959,[117]6.5014,[118]6.5299,[119]6.5508,[120]6.5872,[121]6.6035,[122]6.6283,[123]6.6673,[124]6.6850,[125]6.6763,[126]6.7154,[127]6.7524,[128]6.7799,[129]6.7630,[130]6.7725,[131]6.7673,[132]6.7585,[133]6.7457,[134]6.7569,[135]6.7534,[136]6.7402,[137]6.7322,[138]6.7151,[139]6.7035,[140]6.7005,[141]6.6707,[142]6.6659,[143]6.6380,[144]6.6179,[145]6.6092,[146]6.5957,[147]6.6032,[148]6.6055,[149]6.5994,[150]6.5953,[151]6.5965,[152]6.5870,[153]6.5703,[154]6.5613,[155]6.5681,[156]6.5630,[157]6.5814,[158]6.5849,[159]6.5891,[160]6.5917,[161]6.6041,[162]6.5739,[163]6.5619,[164]6.5357,[165]6.5039,[166]6.4751,[167]6.4378,[168]6.4051,[169]6.3916,[170]6.3791,[171]6.3503,[172]6.3322,[173]6.3136,[174]6.2829,[175]6.2608,[176]6.2505,[177]6.2295,[178]6.2059,[179]6.1888,[180]6.1798,[181]6.1574,[182]6.1382,[183]6.1240,[184]6.1238,[185]6.1165,[186]6.1183,[187]6.1237,[188]6.1200,[189]6.1384,[190]6.1393,[191]6.1597,[192]6.1761,[193]6.1938,[194]6.2055,[195]6.2264,[196]6.2434,[197]6.2655,[198]6.2811,[199]6.2840,[200]6.2886,[201]6.2844,[202]6.3049,[203]6.3116,[204]6.3115,[205]6.3224,[206]6.3302,[207]6.3262,[208]6.3347,[209]6.3399,[210]6.3450,[211]6.3547,[212]6.3621,[213]6.3727,[214]6.3763,[215]6.3803,[216]6.3951,[217]6.4130,[218]6.4265,[219]6.4267,[220]6.4231,[221]6.4169,[222]6.4133,[223]6.4025,[224]6.3958,[225]6.3911,[226]6.4126,[227]6.4213,[228]6.4271,[229]6.4338,[230]6.4294,[231]6.4463,[232]6.4332,[233]6.4161,[234]6.4004,[235]6.3846,[236]6.3768,[237]6.3664,[238]6.3698,[239]6.3536,[240]6.3433,[241]6.3466,[242]6.3504,[243]6.3488,[244]6.3369,[245]6.3343,[246]6.3221,[247]6.3098,[248]6.3030,[249]6.3010,[250]6.3057,[251]6.2981,[252]6.2947,[253]6.2845,[254]6.2804,[255]6.2688,[256]6.2497,[257]6.2386,[258]6.2299,[259]6.2279,[260]6.2197,[261]6.2154,[262]6.2095,[263]6.2050,[264]6.1858,[265]6.1850,[266]6.1835,[267]6.1766,[268]6.1863,[269]6.1843,[270]6.1850,[271]6.1928,[272]6.1974,[273]6.1969,[274]6.1984,[275]6.2073,[276]6.2128,[277]6.2289,[278]6.2397,[279]6.2483,[280]6.2519,[281]6.2617,[282]6.2678,[283]6.2825,[284]6.2903,[285]6.2997,[286]6.3144,[287]6.3138,[288]6.3199,[289]6.3107,[290]6.2956,[291]6.2802,[292]6.2644,[293]6.2505,[294]6.2530,[295]6.2524,[296]6.2567,[297]6.2554,[298]6.2579,[299]6.2551,[300]6.2439,[301]6.2440,[302]6.2360,[303]6.2283,[304]6.2204,[305]6.2180,[306]6.2048,[307]6.2072,[308]6.2104,[309]6.1941,[310]6.1880,[311]6.1816,[312]6.1839,[313]6.1782,[314]6.1770,[315]6.1604,[316]6.1562,[317]6.1395,[318]6.1179,[319]6.1298,[320]6.1429,[321]6.1466,[322]6.1422,[323]6.1356,[324]6.1331,[325]6.1431,[326]6.1430,[327]6.1451,[328]6.1494,[329]6.1554,[330]6.1579,[331]6.1703,[332]6.1672,[333]6.1741,[334]6.1682,[335]6.1618,[336]6.1655,[337]6.1625,[338]6.1612,[339]6.1555,[340]6.1512,[341]6.1589,[342]6.1614,[343]6.1669,[344]6.1668,[345]6.1667,[346]6.1638,[347]6.1686,[348]6.1728,[349]6.1746,[350]6.1712,[351]6.1717,[352]6.1717,[353]6.1665,[354]6.1664,[355]6.1719,[356]6.1749,[357]6.1712,[358]6.1802,[359]6.1833,[360]6.1795,[361]6.1791,[362]6.1858,[363]6.1970,[364]6.2035,[365]6.2093,[366]6.2100,[367]6.2188,[368]6.2166,[369]6.2175,[370]6.2185,[371]6.2125,[372]6.2178,[373]6.2234,[374]6.2221,[375]6.2217,[376]6.2301,[377]6.2252,[378]6.2278,[379]6.2338,[380]6.2254,[381]6.2211,[382]6.2154,[383]6.2144,[384]6.2137,[385]6.2124,[386]6.2119,[387]6.2111,[388]6.2066,[389]6.2012,[390]6.1943,[391]6.1862,[392]6.1822,[393]6.1803,[394]6.1828,[395]6.1812,[396]6.1738,[397]6.1814,[398]6.1852,[399]6.1935,[400]6.1931,[401]6.1945,[402]6.1950,[403]6.1969,[404]6.2032,[405]6.1937,[406]6.1903,[407]6.1895,[408]6.1905,[409]6.2029,[410]6.2139,[411]6.2264,[412]6.2427,[413]6.2542,[414]6.2618,[415]6.2670,[416]6.2750,[417]6.2881,[418]6.2916,[419]6.2990,[420]6.3077,[421]6.3197,[422]6.3255,[423]6.3326,[424]6.3446,[425]6.3537,[426]6.3602,[427]6.3647,[428]6.3730,[429]6.3775,[430]6.3865,[431]6.4011,[432]6.4054,[433]6.4041,[434]6.3995,[435]6.4002,[436]6.4027,[437]6.4121,[438]6.4200,[439]6.4164,[440]6.4159,[441]6.4108,[442]6.4099,[443]6.4112,[444]6.4115,[445]6.4095,[446]6.4118,[447]6.4147,[448]6.4191,[449]6.4164,[450]6.4167,[451]6.4124,[452]6.4006,[453]6.3922,[454]6.3862,[455]6.3869,[456]6.3917,[457]6.3934,[458]6.3912,[459]6.3922,[460]6.4009,[461]6.3981,[462]6.3965,[463]6.4016,[464]6.4007,[465]6.3976,[466]6.3895,[467]6.3898,[468]6.3897,[469]6.3919,[470]6.3924,[471]6.3876,[472]6.3923,[473]6.3866,[474]6.3880,[475]6.3821,[476]6.3844,[477]6.3773,[478]6.3764,[479]6.3827,[480]6.3879,[481]6.3899,[482]6.3854,[483]6.3813,[484]6.3835,[485]6.3818,[486]6.3763,[487]6.3763,[488]6.3744,[489]6.3694,[490]6.3667,[491]6.3637,[492]6.3579,[493]6.3549,[494]6.3531,[495]6.3528,[496]6.3493,[497]6.3440,[498]6.3422,[499]6.3372,[500]6.3275,[501]6.3206,[502]6.3204,[503]6.3202,[504]6.3109,[505]6.3134,[506]6.3143,[507]6.3081,[508]6.3038,[509]6.3027,[510]6.3067,[511]6.3113,[512]6.3148,[513]6.3166,[514]6.3233,[515]6.3177,[516]6.3169,[517]6.3180,[518]6.3181,[519]6.3211,[520]6.3238,[521]6.3255,[522]6.3284,[523]6.3294,[524]6.3357,[525]6.3394,[526]6.3406,[527]6.3426,[528]6.3372,[529]6.3377,[530]6.3329,[531]6.3319,[532]6.3368,[533]6.3391,[534]6.3372,[535]6.3395,[536]6.3341,[537]6.3318,[538]6.3366,[539]6.3378,[540]6.3418,[541]6.3426,[542]6.3433,[543]6.3447,[544]6.3459,[545]6.3437,[546]6.3444,[547]6.3399,[548]6.3344,[549]6.3345,[550]6.3318,[551]6.3280,[552]6.3260,[553]6.3217,[554]6.3195,[555]6.3166,[556]6.3163,[557]6.3186,[558]6.3147,[559]6.3142,[560]6.3137,[561]6.3139,[562]6.3120,[563]6.3120,[564]6.3164,[565]6.3181,[566]6.3178,[567]6.3155,[568]6.3161,[569]6.3144,[570]6.3170,[571]6.3176,[572]6.3186,[573]6.3188,[574]6.3151,[575]6.3147,[576]6.3146,[577]6.3135,[578]6.3114,[579]6.3122,[580]6.3056,[581]6.3018,[582]6.3009,[583]6.3016,[584]6.3020,[585]6.2943,[586]6.2875,[587]6.2878,[588]6.2928,[589]6.2985,[590]6.3016,[591]6.3037,[592]6.3022,[593]6.2985,[594]6.2996,[595]6.2973,[596]6.3011,[597]6.2987,[598]6.2949,[599]6.2971,[600]6.2969,[601]6.2954,[602]6.2972,[603]6.3001,[604]6.3012,[605]6.3044,[606]6.3065,[607]6.3048,[608]6.3013,[609]6.3019,[610]6.3056,[611]6.3038,[612]6.3063,[613]6.3026,[614]6.2975,[615]6.2898,[616]6.2928,[617]6.2865,[618]6.2814,[619]6.2757,[620]6.2615,[621]6.2543,[622]6.2525,[623]6.2540,[624]6.2545,[625]6.2544,[626]6.2529,[627]6.2550,[628]6.2555,[629]6.2553,[630]6.2587,[631]6.2650,[632]6.2704,[633]6.2687,[634]6.2721,[635]6.2726,[636]6.2694,[637]6.2659,[638]6.2686,[639]6.2657,[640]6.2667,[641]6.2669,[642]6.2738,[643]6.2760,[644]6.2772,[645]6.2751,[646]6.2793,[647]6.2755,[648]6.2762,[649]6.2763,[650]6.2801,[651]6.2858,[652]6.2865,[653]6.2908,[654]6.2844,[655]6.2838,

llama_print_timings: load time = 11045.83 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 5755570.69 ms / 335360 tokens ( 17.16 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 5793144.42 ms

slaren · 2023-04-18T20:33:49Z

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

ggerganov · 2023-04-18T20:50:40Z

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

slaren · 2023-04-18T21:40:34Z

Tested successfully under windows. Build with cmake .. -DLLAMA_CUBLAS=ON. The CUDA Tookit is available from https://developer.nvidia.com/cuda-downloads.

Though I would appreciate a review on the cmake changes, I have no idea how any of that works.

Green-Sky · 2023-04-18T22:06:12Z

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

hmm, cmake on ubuntu 20.04 shipps 3.16 by default but even the gh action runner uses 3.26

ggerganov · 2023-04-18T22:08:28Z

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

Green-Sky · 2023-04-18T22:08:50Z

Makefile

@@ -97,6 +97,10 @@ ifdef LLAMA_OPENBLAS
 	CFLAGS  += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
 	LDFLAGS += -lopenblas
 endif
+ifdef LLAMA_CUBLAS
+	CFLAGS  += -DGGML_USE_CUBLAS -I/usr/local/cuda/include
+	LDFLAGS += -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64


pthread is added above depending on os.

wait, do we actually ever link against pthread? why is it only a compile flag?

From what I understand it is a dependency of cuda, so it is required to build with cublas.

Green-Sky · 2023-04-18T22:24:02Z

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

the cmake_minimum_required() call looks like a function you could call anywhere. @slaren can you try just calling it again with a higher number in the conditional?

slaren · 2023-04-18T22:38:29Z

That seems to work, updated.

Green-Sky · 2023-04-19T00:12:05Z

That seems to work, updated.

$ cmake .
CMake Error at CMakeLists.txt:147 (cmake_minimum_required):
  CMake 3.17 or higher is required.  You are running version 3.16.3

yup, perfect

KyTiXo · 2023-04-19T02:54:30Z

Very exciting. Can't wait to try it out 🤩

LostRuins · 2023-04-19T09:52:21Z

Just wondering for all those who have tried, how much speedup do you get in the batched prompt eval timings vs openblas (not perplexity calculations)? Would be good to benchmark against a fixed context size, say 1024 tokens.

LostRuins · 2023-04-19T09:56:01Z

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

rabidcopy · 2023-04-19T15:27:12Z

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

Found a comparison someone did between llama.cpp with cuBLAS and koboldcpp with CLBlast. Maybe it would be worth implementing CLBlast over here as well? (Sorry, wasn't aware there was further improvements on CLBLast in koboldcpp since I last compared on my own hardware.)

make clean && LLAMA_OPENBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 27152.17 ms
llama_print_timings:      sample time =    23.20 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 25333.24 ms /   399 tokens (   63.49 ms per token)
llama_print_timings:        eval time = 10619.50 ms /    49 runs   (  216.72 ms per run)
llama_print_timings:       total time = 37795.51 ms

make clean && LLAMA_CUBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 12408.19 ms
llama_print_timings:      sample time =    22.31 ms /    50 runs   (    0.45 ms per run)
llama_print_timings: prompt eval time = 10300.15 ms /   399 tokens (   25.81 ms per token)
llama_print_timings:        eval time = 10533.55 ms /    49 runs   (  214.97 ms per run)
llama_print_timings:       total time = 22964.58 ms

make clean && LLAMA_CLBLAST=1 make -j main && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 13699.05 ms
llama_print_timings:      sample time =    22.91 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 11899.14 ms /   399 tokens (   29.82 ms per token)
llama_print_timings:        eval time = 10496.48 ms /    49 runs   (  214.21 ms per run)
llama_print_timings:       total time = 24218.98 ms

ghost · 2023-04-19T16:16:18Z

@LostRuins I have a thread going on in the discussions where people are trying out the Kobold clblast implementation. On my integrated Intel HD530 clblast prompt ingestion was twice as slow as openblas but someone with a Nvidia 3060 reported a 50% improvement on his end.

Azeirah · 2023-04-19T21:34:00Z

Here are benchmarks for my system

Note: This is with the non-quantized 13B-16bit model

cpu ryzen 7900x
gpu 1080ti
ram 64GiB@5200

With cublas

make clean && LLAMA_CUBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 20691.75 ms
llama_print_timings:      sample time =    16.89 ms /    50 runs   (    0.34 ms per run)
llama_print_timings: prompt eval time = 18748.63 ms /   373 tokens (   50.26 ms per token)
llama_print_timings:        eval time = 24565.83 ms /    49 runs   (  501.34 ms per run)
llama_print_timings:       total time = 45275.08 ms

With OpenBLAS

make clean && LLAMA_OPENBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 43043.43 ms
llama_print_timings:      sample time =    17.31 ms /    50 runs   (    0.35 ms per run)
llama_print_timings: prompt eval time = 27472.01 ms /   373 tokens (   73.65 ms per token)
llama_print_timings:        eval time = 24480.05 ms /    49 runs   (  499.59 ms per run)
llama_print_timings:       total time = 67541.45 ms

So that's a ~48% total time speedup, super nice!

ggerganov · 2023-04-22T09:32:55Z

cc @ravenscroftj
Might be interested in adding cuBLAS support to turbopilot to speed-up prompt processing. This change works with low-VRAM cards even for big models and is optionally enabled with GGML_USE_CUBLAS compile flag:

https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L107-L115

Will be available in the ggml repo soon as well

ravenscroftj · 2023-04-22T15:38:08Z

oh that is awesome thanks for the tag @ggerganov - will definitely be looking at adding this as making suggestions much faster will make turbopilot much more usable!

jon-chuang · 2023-04-26T17:36:56Z

ggml.c

+        CUDA_CHECK(cudaFree(d_X));
+        CUDA_CHECK(cudaFree(d_Y));
+        CUDA_CHECK(cudaFree(d_D));
+#endif


Why not add cuda quantize row below as well?

It's not used in cuBLAS.

yes, my bad, we do not need to quantize the out tensor nor the weight matrix.

Add NVIDIA cuBLAS support

4440d19

ggerganov approved these changes Apr 18, 2023

View reviewed changes

Add support to cmake

5fc6799

slaren marked this pull request as ready for review April 18, 2023 21:40

Green-Sky reviewed Apr 18, 2023

View reviewed changes

slaren force-pushed the cublas branch from efa97ce to ce3d01a Compare April 18, 2023 22:39

Cleanup cublas comments

40846bd

slaren force-pushed the cublas branch from ce3d01a to 40846bd Compare April 18, 2023 22:40

slaren merged commit 8944a13 into ggerganov:master Apr 19, 2023

slaren deleted the cublas branch April 19, 2023 09:22

vakkov mentioned this pull request Apr 19, 2023

GPU support via NVBLAS ggerganov/whisper.cpp#239

Closed

apcameron mentioned this pull request Apr 19, 2023

Add OpenCL clBLAS support #1072

Closed

ggerganov mentioned this pull request Apr 20, 2023

Unexpected performance issue with longer prompts? #938

Closed

rabidcopy mentioned this pull request Apr 20, 2023

Any chance of adding Clblast support? #1059

Closed

snxraven mentioned this pull request Apr 21, 2023

No cuBLAS abetlen/llama-cpp-python#101

Closed

jon-chuang reviewed Apr 26, 2023

View reviewed changes

slaren mentioned this pull request Apr 30, 2023

No cuBLAS performance gain for F16 #1249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA cuBLAS support #1044

Add NVIDIA cuBLAS support #1044

slaren commented Apr 18, 2023 •

edited

Loading

rabidcopy commented Apr 18, 2023

ggerganov left a comment

slaren commented Apr 18, 2023 •

edited

Loading

slaren commented Apr 18, 2023

slaren commented Apr 18, 2023

ggerganov commented Apr 18, 2023

slaren commented Apr 18, 2023 •

edited

Loading

Green-Sky commented Apr 18, 2023

ggerganov commented Apr 18, 2023

Green-Sky Apr 18, 2023

Green-Sky Apr 18, 2023

slaren Apr 18, 2023

Green-Sky commented Apr 18, 2023

slaren commented Apr 18, 2023

Green-Sky commented Apr 19, 2023

KyTiXo commented Apr 19, 2023

LostRuins commented Apr 19, 2023 •

edited

Loading

LostRuins commented Apr 19, 2023

rabidcopy commented Apr 19, 2023 •

edited

Loading

ghost commented Apr 19, 2023

Azeirah commented Apr 19, 2023 •

edited

Loading

ggerganov commented Apr 22, 2023 •

edited

Loading

ravenscroftj commented Apr 22, 2023

jon-chuang Apr 26, 2023

slaren Apr 26, 2023

jon-chuang Apr 26, 2023

Add NVIDIA cuBLAS support #1044

Add NVIDIA cuBLAS support #1044

Conversation

slaren commented Apr 18, 2023 • edited Loading

rabidcopy commented Apr 18, 2023

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Apr 18, 2023 • edited Loading

slaren commented Apr 18, 2023

slaren commented Apr 18, 2023

ggerganov commented Apr 18, 2023

slaren commented Apr 18, 2023 • edited Loading

Green-Sky commented Apr 18, 2023

ggerganov commented Apr 18, 2023

Green-Sky Apr 18, 2023

Choose a reason for hiding this comment

Green-Sky Apr 18, 2023

Choose a reason for hiding this comment

slaren Apr 18, 2023

Choose a reason for hiding this comment

Green-Sky commented Apr 18, 2023

slaren commented Apr 18, 2023

Green-Sky commented Apr 19, 2023

KyTiXo commented Apr 19, 2023

LostRuins commented Apr 19, 2023 • edited Loading

LostRuins commented Apr 19, 2023

rabidcopy commented Apr 19, 2023 • edited Loading

ghost commented Apr 19, 2023

Azeirah commented Apr 19, 2023 • edited Loading

ggerganov commented Apr 22, 2023 • edited Loading

ravenscroftj commented Apr 22, 2023

jon-chuang Apr 26, 2023

Choose a reason for hiding this comment

slaren Apr 26, 2023

Choose a reason for hiding this comment

jon-chuang Apr 26, 2023

Choose a reason for hiding this comment

slaren commented Apr 18, 2023 •

edited

Loading

slaren commented Apr 18, 2023 •

edited

Loading

slaren commented Apr 18, 2023 •

edited

Loading

LostRuins commented Apr 19, 2023 •

edited

Loading

rabidcopy commented Apr 19, 2023 •

edited

Loading

Azeirah commented Apr 19, 2023 •

edited

Loading

ggerganov commented Apr 22, 2023 •

edited

Loading