Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVIDIA cuBLAS support #1044

Merged
merged 3 commits into from
Apr 19, 2023
Merged

Add NVIDIA cuBLAS support #1044

merged 3 commits into from
Apr 19, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 18, 2023

Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS.

Build with LLAMA_CUBLAS:

make clean && LLAMA_CUBLAS=1 make

Perplexity seconds per pass (i9 9900k, RTX 3080 10GB)

7B q4_0 7B f16 7B f32
cuBLAS 8.92 5.24 7.70
OpenBLAS 22.64 24.85 18.18
No BLAS 26.39 30.35 54.33

@rabidcopy
Copy link
Contributor

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - and I guess ppl results are similar between non-cuBLAS and cuBLAS?

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2023

I haven't completed a full run yet, but with 7B q4_0, the perplexity of the first iterations is identical to OpenBLAS. It will probably be higher in f16xf32 because instead of converting to f32xf32, I convert to f16xf16.

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2023

Perplexity with 7B q4_0 is 6.2838

./perplexity -m models/7B/ggml-model-q4_0.bin -f wikitext-2-raw/wiki.test.raw -t 8 main: seed = 1681837585 llama.cpp: loading model from models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
9.13 seconds per pass - ETA 1.66 hours
[1]4.3798,[2]4.9554,[3]5.8269,[4]6.4695,[5]6.5438,[6]6.5414,[7]6.7175,[8]6.8070,[9]7.1756,[10]7.4121,[11]7.6567,[12]7.6957,[13]7.6057,[14]7.6821,[15]7.9367,[16]7.5419,[17]7.4189,[18]7.3798,[19]7.0077,[20]6.9947,[21]6.8969,[22]6.7124,[23]6.6743,[24]6.5868,[25]6.5871,[26]6.4149,[27]6.2349,[28]6.1341,[29]6.0499,[30]5.8939,[31]5.8660,[32]5.8840,[33]5.8190,[34]5.8538,[35]5.8796,[36]5.9233,[37]5.9272,[38]5.9444,[39]5.9825,[40]6.0413,[41]6.0483,[42]6.0827,[43]6.0398,[44]6.0945,[45]6.0989,[46]6.0730,[47]6.0968,[48]6.0675,[49]6.0746,[50]6.0352,[51]6.0311,[52]6.0201,[53]6.0642,[54]6.0477,[55]6.0251,[56]6.0595,[57]6.0826,[58]6.1044,[59]6.1183,[60]6.1648,[61]6.1537,[62]6.2167,[63]6.2503,[64]6.2654,[65]6.3119,[66]6.3221,[67]6.3402,[68]6.3542,[69]6.3791,[70]6.4114,[71]6.4328,[72]6.4626,[73]6.5278,[74]6.5331,[75]6.5475,[76]6.5638,[77]6.5771,[78]6.5619,[79]6.5915,[80]6.5840,[81]6.5968,[82]6.6005,[83]6.5468,[84]6.5323,[85]6.5209,[86]6.4998,[87]6.4344,[88]6.4060,[89]6.3854,[90]6.3688,[91]6.3949,[92]6.3910,[93]6.3936,[94]6.3911,[95]6.4199,[96]6.4178,[97]6.4106,[98]6.4036,[99]6.3896,[100]6.3896,[101]6.4155,[102]6.4092,[103]6.4309,[104]6.4377,[105]6.4362,[106]6.4539,[107]6.4526,[108]6.4649,[109]6.4596,[110]6.4551,[111]6.4780,[112]6.4970,[113]6.4984,[114]6.4950,[115]6.5033,[116]6.4959,[117]6.5014,[118]6.5299,[119]6.5508,[120]6.5872,[121]6.6035,[122]6.6283,[123]6.6673,[124]6.6850,[125]6.6763,[126]6.7154,[127]6.7524,[128]6.7799,[129]6.7630,[130]6.7725,[131]6.7673,[132]6.7585,[133]6.7457,[134]6.7569,[135]6.7534,[136]6.7402,[137]6.7322,[138]6.7151,[139]6.7035,[140]6.7005,[141]6.6707,[142]6.6659,[143]6.6380,[144]6.6179,[145]6.6092,[146]6.5957,[147]6.6032,[148]6.6055,[149]6.5994,[150]6.5953,[151]6.5965,[152]6.5870,[153]6.5703,[154]6.5613,[155]6.5681,[156]6.5630,[157]6.5814,[158]6.5849,[159]6.5891,[160]6.5917,[161]6.6041,[162]6.5739,[163]6.5619,[164]6.5357,[165]6.5039,[166]6.4751,[167]6.4378,[168]6.4051,[169]6.3916,[170]6.3791,[171]6.3503,[172]6.3322,[173]6.3136,[174]6.2829,[175]6.2608,[176]6.2505,[177]6.2295,[178]6.2059,[179]6.1888,[180]6.1798,[181]6.1574,[182]6.1382,[183]6.1240,[184]6.1238,[185]6.1165,[186]6.1183,[187]6.1237,[188]6.1200,[189]6.1384,[190]6.1393,[191]6.1597,[192]6.1761,[193]6.1938,[194]6.2055,[195]6.2264,[196]6.2434,[197]6.2655,[198]6.2811,[199]6.2840,[200]6.2886,[201]6.2844,[202]6.3049,[203]6.3116,[204]6.3115,[205]6.3224,[206]6.3302,[207]6.3262,[208]6.3347,[209]6.3399,[210]6.3450,[211]6.3547,[212]6.3621,[213]6.3727,[214]6.3763,[215]6.3803,[216]6.3951,[217]6.4130,[218]6.4265,[219]6.4267,[220]6.4231,[221]6.4169,[222]6.4133,[223]6.4025,[224]6.3958,[225]6.3911,[226]6.4126,[227]6.4213,[228]6.4271,[229]6.4338,[230]6.4294,[231]6.4463,[232]6.4332,[233]6.4161,[234]6.4004,[235]6.3846,[236]6.3768,[237]6.3664,[238]6.3698,[239]6.3536,[240]6.3433,[241]6.3466,[242]6.3504,[243]6.3488,[244]6.3369,[245]6.3343,[246]6.3221,[247]6.3098,[248]6.3030,[249]6.3010,[250]6.3057,[251]6.2981,[252]6.2947,[253]6.2845,[254]6.2804,[255]6.2688,[256]6.2497,[257]6.2386,[258]6.2299,[259]6.2279,[260]6.2197,[261]6.2154,[262]6.2095,[263]6.2050,[264]6.1858,[265]6.1850,[266]6.1835,[267]6.1766,[268]6.1863,[269]6.1843,[270]6.1850,[271]6.1928,[272]6.1974,[273]6.1969,[274]6.1984,[275]6.2073,[276]6.2128,[277]6.2289,[278]6.2397,[279]6.2483,[280]6.2519,[281]6.2617,[282]6.2678,[283]6.2825,[284]6.2903,[285]6.2997,[286]6.3144,[287]6.3138,[288]6.3199,[289]6.3107,[290]6.2956,[291]6.2802,[292]6.2644,[293]6.2505,[294]6.2530,[295]6.2524,[296]6.2567,[297]6.2554,[298]6.2579,[299]6.2551,[300]6.2439,[301]6.2440,[302]6.2360,[303]6.2283,[304]6.2204,[305]6.2180,[306]6.2048,[307]6.2072,[308]6.2104,[309]6.1941,[310]6.1880,[311]6.1816,[312]6.1839,[313]6.1782,[314]6.1770,[315]6.1604,[316]6.1562,[317]6.1395,[318]6.1179,[319]6.1298,[320]6.1429,[321]6.1466,[322]6.1422,[323]6.1356,[324]6.1331,[325]6.1431,[326]6.1430,[327]6.1451,[328]6.1494,[329]6.1554,[330]6.1579,[331]6.1703,[332]6.1672,[333]6.1741,[334]6.1682,[335]6.1618,[336]6.1655,[337]6.1625,[338]6.1612,[339]6.1555,[340]6.1512,[341]6.1589,[342]6.1614,[343]6.1669,[344]6.1668,[345]6.1667,[346]6.1638,[347]6.1686,[348]6.1728,[349]6.1746,[350]6.1712,[351]6.1717,[352]6.1717,[353]6.1665,[354]6.1664,[355]6.1719,[356]6.1749,[357]6.1712,[358]6.1802,[359]6.1833,[360]6.1795,[361]6.1791,[362]6.1858,[363]6.1970,[364]6.2035,[365]6.2093,[366]6.2100,[367]6.2188,[368]6.2166,[369]6.2175,[370]6.2185,[371]6.2125,[372]6.2178,[373]6.2234,[374]6.2221,[375]6.2217,[376]6.2301,[377]6.2252,[378]6.2278,[379]6.2338,[380]6.2254,[381]6.2211,[382]6.2154,[383]6.2144,[384]6.2137,[385]6.2124,[386]6.2119,[387]6.2111,[388]6.2066,[389]6.2012,[390]6.1943,[391]6.1862,[392]6.1822,[393]6.1803,[394]6.1828,[395]6.1812,[396]6.1738,[397]6.1814,[398]6.1852,[399]6.1935,[400]6.1931,[401]6.1945,[402]6.1950,[403]6.1969,[404]6.2032,[405]6.1937,[406]6.1903,[407]6.1895,[408]6.1905,[409]6.2029,[410]6.2139,[411]6.2264,[412]6.2427,[413]6.2542,[414]6.2618,[415]6.2670,[416]6.2750,[417]6.2881,[418]6.2916,[419]6.2990,[420]6.3077,[421]6.3197,[422]6.3255,[423]6.3326,[424]6.3446,[425]6.3537,[426]6.3602,[427]6.3647,[428]6.3730,[429]6.3775,[430]6.3865,[431]6.4011,[432]6.4054,[433]6.4041,[434]6.3995,[435]6.4002,[436]6.4027,[437]6.4121,[438]6.4200,[439]6.4164,[440]6.4159,[441]6.4108,[442]6.4099,[443]6.4112,[444]6.4115,[445]6.4095,[446]6.4118,[447]6.4147,[448]6.4191,[449]6.4164,[450]6.4167,[451]6.4124,[452]6.4006,[453]6.3922,[454]6.3862,[455]6.3869,[456]6.3917,[457]6.3934,[458]6.3912,[459]6.3922,[460]6.4009,[461]6.3981,[462]6.3965,[463]6.4016,[464]6.4007,[465]6.3976,[466]6.3895,[467]6.3898,[468]6.3897,[469]6.3919,[470]6.3924,[471]6.3876,[472]6.3923,[473]6.3866,[474]6.3880,[475]6.3821,[476]6.3844,[477]6.3773,[478]6.3764,[479]6.3827,[480]6.3879,[481]6.3899,[482]6.3854,[483]6.3813,[484]6.3835,[485]6.3818,[486]6.3763,[487]6.3763,[488]6.3744,[489]6.3694,[490]6.3667,[491]6.3637,[492]6.3579,[493]6.3549,[494]6.3531,[495]6.3528,[496]6.3493,[497]6.3440,[498]6.3422,[499]6.3372,[500]6.3275,[501]6.3206,[502]6.3204,[503]6.3202,[504]6.3109,[505]6.3134,[506]6.3143,[507]6.3081,[508]6.3038,[509]6.3027,[510]6.3067,[511]6.3113,[512]6.3148,[513]6.3166,[514]6.3233,[515]6.3177,[516]6.3169,[517]6.3180,[518]6.3181,[519]6.3211,[520]6.3238,[521]6.3255,[522]6.3284,[523]6.3294,[524]6.3357,[525]6.3394,[526]6.3406,[527]6.3426,[528]6.3372,[529]6.3377,[530]6.3329,[531]6.3319,[532]6.3368,[533]6.3391,[534]6.3372,[535]6.3395,[536]6.3341,[537]6.3318,[538]6.3366,[539]6.3378,[540]6.3418,[541]6.3426,[542]6.3433,[543]6.3447,[544]6.3459,[545]6.3437,[546]6.3444,[547]6.3399,[548]6.3344,[549]6.3345,[550]6.3318,[551]6.3280,[552]6.3260,[553]6.3217,[554]6.3195,[555]6.3166,[556]6.3163,[557]6.3186,[558]6.3147,[559]6.3142,[560]6.3137,[561]6.3139,[562]6.3120,[563]6.3120,[564]6.3164,[565]6.3181,[566]6.3178,[567]6.3155,[568]6.3161,[569]6.3144,[570]6.3170,[571]6.3176,[572]6.3186,[573]6.3188,[574]6.3151,[575]6.3147,[576]6.3146,[577]6.3135,[578]6.3114,[579]6.3122,[580]6.3056,[581]6.3018,[582]6.3009,[583]6.3016,[584]6.3020,[585]6.2943,[586]6.2875,[587]6.2878,[588]6.2928,[589]6.2985,[590]6.3016,[591]6.3037,[592]6.3022,[593]6.2985,[594]6.2996,[595]6.2973,[596]6.3011,[597]6.2987,[598]6.2949,[599]6.2971,[600]6.2969,[601]6.2954,[602]6.2972,[603]6.3001,[604]6.3012,[605]6.3044,[606]6.3065,[607]6.3048,[608]6.3013,[609]6.3019,[610]6.3056,[611]6.3038,[612]6.3063,[613]6.3026,[614]6.2975,[615]6.2898,[616]6.2928,[617]6.2865,[618]6.2814,[619]6.2757,[620]6.2615,[621]6.2543,[622]6.2525,[623]6.2540,[624]6.2545,[625]6.2544,[626]6.2529,[627]6.2550,[628]6.2555,[629]6.2553,[630]6.2587,[631]6.2650,[632]6.2704,[633]6.2687,[634]6.2721,[635]6.2726,[636]6.2694,[637]6.2659,[638]6.2686,[639]6.2657,[640]6.2667,[641]6.2669,[642]6.2738,[643]6.2760,[644]6.2772,[645]6.2751,[646]6.2793,[647]6.2755,[648]6.2762,[649]6.2763,[650]6.2801,[651]6.2858,[652]6.2865,[653]6.2908,[654]6.2844,[655]6.2838,

llama_print_timings: load time = 11045.83 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 5755570.69 ms / 335360 tokens ( 17.16 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 5793144.42 ms

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2023

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

@ggerganov
Copy link
Owner

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2023

Tested successfully under windows. Build with cmake .. -DLLAMA_CUBLAS=ON. The CUDA Tookit is available from https://developer.nvidia.com/cuda-downloads.

Though I would appreciate a review on the cmake changes, I have no idea how any of that works.

@slaren slaren marked this pull request as ready for review April 18, 2023 21:40
@Green-Sky
Copy link
Collaborator

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

hmm, cmake on ubuntu 20.04 shipps 3.16 by default but even the gh action runner uses 3.26

@ggerganov
Copy link
Owner

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

@@ -97,6 +97,10 @@ ifdef LLAMA_OPENBLAS
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
LDFLAGS += -lopenblas
endif
ifdef LLAMA_CUBLAS
CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include
LDFLAGS += -lcublas_static -lculibos -lcudart_static -lcublasLt_static -lpthread -ldl -L/usr/local/cuda/lib64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthread is added above depending on os.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, do we actually ever link against pthread? why is it only a compile flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand it is a dependency of cuda, so it is required to build with cublas.

@Green-Sky
Copy link
Collaborator

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

the cmake_minimum_required() call looks like a function you could call anywhere. @slaren can you try just calling it again with a higher number in the conditional?

@slaren
Copy link
Collaborator Author

slaren commented Apr 18, 2023

That seems to work, updated.

@Green-Sky
Copy link
Collaborator

That seems to work, updated.

$ cmake .
CMake Error at CMakeLists.txt:147 (cmake_minimum_required):
  CMake 3.17 or higher is required.  You are running version 3.16.3

yup, perfect

@KyTiXo
Copy link

KyTiXo commented Apr 19, 2023

Very exciting. Can't wait to try it out 🤩

@slaren slaren merged commit 8944a13 into ggerganov:master Apr 19, 2023
@slaren slaren deleted the cublas branch April 19, 2023 09:22
@LostRuins
Copy link
Collaborator

LostRuins commented Apr 19, 2023

Just wondering for all those who have tried, how much speedup do you get in the batched prompt eval timings vs openblas (not perplexity calculations)? Would be good to benchmark against a fixed context size, say 1024 tokens.

@LostRuins
Copy link
Collaborator

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 19, 2023

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

Found a comparison someone did between llama.cpp with cuBLAS and koboldcpp with CLBlast. Maybe it would be worth implementing CLBlast over here as well? (Sorry, wasn't aware there was further improvements on CLBLast in koboldcpp since I last compared on my own hardware.)

make clean && LLAMA_OPENBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 27152.17 ms
llama_print_timings:      sample time =    23.20 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 25333.24 ms /   399 tokens (   63.49 ms per token)
llama_print_timings:        eval time = 10619.50 ms /    49 runs   (  216.72 ms per run)
llama_print_timings:       total time = 37795.51 ms
make clean && LLAMA_CUBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 12408.19 ms
llama_print_timings:      sample time =    22.31 ms /    50 runs   (    0.45 ms per run)
llama_print_timings: prompt eval time = 10300.15 ms /   399 tokens (   25.81 ms per token)
llama_print_timings:        eval time = 10533.55 ms /    49 runs   (  214.97 ms per run)
llama_print_timings:       total time = 22964.58 ms
make clean && LLAMA_CLBLAST=1 make -j main && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 13699.05 ms
llama_print_timings:      sample time =    22.91 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 11899.14 ms /   399 tokens (   29.82 ms per token)
llama_print_timings:        eval time = 10496.48 ms /    49 runs   (  214.21 ms per run)
llama_print_timings:       total time = 24218.98 ms

@ghost
Copy link

ghost commented Apr 19, 2023

@LostRuins I have a thread going on in the discussions where people are trying out the Kobold clblast implementation. On my integrated Intel HD530 clblast prompt ingestion was twice as slow as openblas but someone with a Nvidia 3060 reported a 50% improvement on his end.

@Azeirah
Copy link
Contributor

Azeirah commented Apr 19, 2023

Here are benchmarks for my system

Note: This is with the non-quantized 13B-16bit model

  • cpu ryzen 7900x
  • gpu 1080ti
  • ram 64GiB@5200

With cublas

make clean && LLAMA_CUBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 20691.75 ms
llama_print_timings:      sample time =    16.89 ms /    50 runs   (    0.34 ms per run)
llama_print_timings: prompt eval time = 18748.63 ms /   373 tokens (   50.26 ms per token)
llama_print_timings:        eval time = 24565.83 ms /    49 runs   (  501.34 ms per run)
llama_print_timings:       total time = 45275.08 ms

With OpenBLAS

make clean && LLAMA_OPENBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 43043.43 ms
llama_print_timings:      sample time =    17.31 ms /    50 runs   (    0.35 ms per run)
llama_print_timings: prompt eval time = 27472.01 ms /   373 tokens (   73.65 ms per token)
llama_print_timings:        eval time = 24480.05 ms /    49 runs   (  499.59 ms per run)
llama_print_timings:       total time = 67541.45 ms

So that's a ~48% total time speedup, super nice!

@ggerganov
Copy link
Owner

ggerganov commented Apr 22, 2023

cc @ravenscroftj
Might be interested in adding cuBLAS support to turbopilot to speed-up prompt processing. This change works with low-VRAM cards even for big models and is optionally enabled with GGML_USE_CUBLAS compile flag:

https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L107-L115

Will be available in the ggml repo soon as well

@ravenscroftj
Copy link

oh that is awesome thanks for the tag @ggerganov - will definitely be looking at adding this as making suggestions much faster will make turbopilot much more usable!

CUDA_CHECK(cudaFree(d_X));
CUDA_CHECK(cudaFree(d_Y));
CUDA_CHECK(cudaFree(d_D));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add cuda quantize row below as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used in cuBLAS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, my bad, we do not need to quantize the out tensor nor the weight matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants