Variable bit rate quantization #1256

ikawrakow · 2023-04-30T16:46:25Z

Variable bit rate is commonly used in audio and video compression, so why not try on LLMs?

My guess is that a locally adaptive variable bit rate would require a major change to ggml. So, then, the least one can try is to see if using different number of bits in the different network layers would be beneficial.

As a first step, I simply changed llama.cpp to not quantize one of the tensor types in addition to output.weight (which is already known to have a significant impact on generation quality) and calculated perplexity for Q2_4 quantization (see issue #1240). Picked 2-bit quantization because there the difference between a quantized and not quantized tensor will be largest, so it would be easiest to see the effect. The following table summarizes the results (PPL improvement is perplexity with fp16 output.weight - perplexity with fp16 output weight + indicated tensor, table is sorted in decreasing order of impact)

Tensor type	PPL improvement
feed_forward.w2	1.0832
attention.wv	0.7819
feed_forward.w3	0.5658
feed_forward.w1	0.3917
attention.wo	0.3902
attention.wq	0.1250
attention.wk	0.1090
tok_embeddings	< 0.05

Interesting to note that the tok_embeddings tensor, which has been considered worthy of a dedicated quantization type LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16 where it is kept as fp16, has basically no influence on generation quality even when quantized with 2 bits.

Based on these findings, I ran 2- and 3-bit perplexity calculations where the top 2 tensors feed_forward.w2 and attention.wv are quantized using Q5_1 instead of Q2_4 or Q3_4. Here are the results:

Model	Quantization	file size	Perplexity
7B	Q2_4 + Q5_1	3.16G	6.8160
7B	Q3_4 + Q5_1	3.7G	6.0996
13B	Q2_4 + Q5_1	6.1G	5.7880
13B	Q3_4 + Q5_1	7.1G	5.3715

Interesting to note that the mixed Q3_4 + Q5_1 quantization has a lower perplexity than any 4-bit quantization listed on the main page for the 7B model despite the smaller quantized model size.

I have not explored only using Q5_1 for a subset of the feed_forward.w2 and attention.wv tensors. The quantization rmse for these two tensor types increases with layer depth, so perhaps it would be sufficient to use a higher bit rate for only the last few layers, thus reducing quantized model size compared to what is given in the above table.

Here are the complete runs for the above table. There is no new quantization type, just a quick hack where I added

            if (tensor.name == "output.weight" ||
                tensor.name.find("attention.wv.weight") != std::string::npos ||
                tensor.name.find("feed_forward.w2.weight") != std::string::npos) {
                new_type = GGML_TYPE_Q5_1;
            }

just after this line in llama.cpp.

7B, Q2_4 + Q5_1

main: seed = 1682863345 llama.cpp: loading model from ../build/junk.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 15 (mostly Q2_4) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5026.65 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
1.61 seconds per pass - ETA 17 minutes
[1]4.8689,[2]5.5380,[3]6.3512,[4]7.0241,[5]7.0748,[6]7.0472,[7]7.2764,[8]7.3672,[9]7.7450,[10]8.0356,[11]8.2671,[12]8.3551,[13]8.2873,[14]8.4518,[15]8.7368,[16]8.2773,[17]8.1054,[18]8.0547,[19]7.6308,[20]7.5978,[21]7.4992,[22]7.3066,[23]7.2867,[24]7.1873,[25]7.1983,[26]7.0220,[27]6.8122,[28]6.7195,[29]6.6220,[30]6.4602,[31]6.4277,[32]6.4450,[33]6.3845,[34]6.4217,[35]6.4431,[36]6.4962,[37]6.4978,[38]6.5093,[39]6.5542,[40]6.6065,[41]6.6159,[42]6.6562,[43]6.6069,[44]6.6684,[45]6.6746,[46]6.6483,[47]6.6718,[48]6.6386,[49]6.6381,[50]6.5934,[51]6.5892,[52]6.5701,[53]6.6233,[54]6.6083,[55]6.5750,[56]6.6126,[57]6.6321,[58]6.6554,[59]6.6693,[60]6.7166,[61]6.7055,[62]6.7665,[63]6.8012,[64]6.8124,[65]6.8597,[66]6.8646,[67]6.8819,[68]6.8975,[69]6.9258,[70]6.9585,[71]6.9829,[72]7.0148,[73]7.0743,[74]7.0740,[75]7.0859,[76]7.0966,[77]7.1070,[78]7.0913,[79]7.1232,[80]7.1149,[81]7.1345,[82]7.1442,[83]7.0841,[84]7.0665,[85]7.0551,[86]7.0270,[87]6.9683,[88]6.9379,[89]6.9188,[90]6.9039,[91]6.9317,[92]6.9235,[93]6.9269,[94]6.9251,[95]6.9596,[96]6.9609,[97]6.9612,[98]6.9529,[99]6.9338,[100]6.9302,[101]6.9593,[102]6.9523,[103]6.9737,[104]6.9851,[105]6.9858,[106]7.0021,[107]6.9985,[108]7.0140,[109]7.0090,[110]7.0046,[111]7.0250,[112]7.0456,[113]7.0524,[114]7.0490,[115]7.0560,[116]7.0459,[117]7.0517,[118]7.0811,[119]7.1028,[120]7.1413,[121]7.1622,[122]7.1874,[123]7.2290,[124]7.2475,[125]7.2360,[126]7.2746,[127]7.3097,[128]7.3423,[129]7.3207,[130]7.3302,[131]7.3224,[132]7.3129,[133]7.3003,[134]7.3145,[135]7.3089,[136]7.2968,[137]7.2890,[138]7.2756,[139]7.2651,[140]7.2619,[141]7.2379,[142]7.2336,[143]7.2057,[144]7.1847,[145]7.1773,[146]7.1654,[147]7.1718,[148]7.1704,[149]7.1673,[150]7.1636,[151]7.1664,[152]7.1540,[153]7.1358,[154]7.1244,[155]7.1304,[156]7.1259,[157]7.1421,[158]7.1439,[159]7.1515,[160]7.1542,[161]7.1698,[162]7.1374,[163]7.1246,[164]7.0971,[165]7.0609,[166]7.0288,[167]6.9864,[168]6.9518,[169]6.9381,[170]6.9242,[171]6.8931,[172]6.8715,[173]6.8532,[174]6.8210,[175]6.7957,[176]6.7799,[177]6.7599,[178]6.7353,[179]6.7163,[180]6.7042,[181]6.6793,[182]6.6581,[183]6.6411,[184]6.6389,[185]6.6304,[186]6.6321,[187]6.6376,[188]6.6348,[189]6.6532,[190]6.6555,[191]6.6786,[192]6.6938,[193]6.7137,[194]6.7257,[195]6.7503,[196]6.7668,[197]6.7887,[198]6.8067,[199]6.8101,[200]6.8141,[201]6.8096,[202]6.8342,[203]6.8424,[204]6.8470,[205]6.8581,[206]6.8657,[207]6.8619,[208]6.8726,[209]6.8780,[210]6.8816,[211]6.8934,[212]6.9011,[213]6.9121,[214]6.9185,[215]6.9231,[216]6.9367,[217]6.9563,[218]6.9708,[219]6.9702,[220]6.9650,[221]6.9601,[222]6.9569,[223]6.9457,[224]6.9389,[225]6.9346,[226]6.9554,[227]6.9661,[228]6.9727,[229]6.9772,[230]6.9739,[231]6.9931,[232]6.9831,[233]6.9635,[234]6.9467,[235]6.9311,[236]6.9226,[237]6.9113,[238]6.9156,[239]6.8994,[240]6.8880,[241]6.8921,[242]6.8965,[243]6.8932,[244]6.8800,[245]6.8776,[246]6.8658,[247]6.8522,[248]6.8445,[249]6.8409,[250]6.8471,[251]6.8399,[252]6.8348,[253]6.8244,[254]6.8182,[255]6.8047,[256]6.7833,[257]6.7706,[258]6.7620,[259]6.7591,[260]6.7503,[261]6.7464,[262]6.7397,[263]6.7344,[264]6.7181,[265]6.7174,[266]6.7143,[267]6.7061,[268]6.7162,[269]6.7142,[270]6.7137,[271]6.7213,[272]6.7260,[273]6.7251,[274]6.7284,[275]6.7389,[276]6.7449,[277]6.7640,[278]6.7748,[279]6.7843,[280]6.7867,[281]6.7962,[282]6.8019,[283]6.8184,[284]6.8264,[285]6.8361,[286]6.8503,[287]6.8506,[288]6.8577,[289]6.8476,[290]6.8306,[291]6.8143,[292]6.7983,[293]6.7837,[294]6.7858,[295]6.7836,[296]6.7889,[297]6.7877,[298]6.7916,[299]6.7881,[300]6.7765,[301]6.7751,[302]6.7663,[303]6.7569,[304]6.7478,[305]6.7439,[306]6.7292,[307]6.7309,[308]6.7356,[309]6.7174,[310]6.7112,[311]6.7044,[312]6.7065,[313]6.6999,[314]6.6979,[315]6.6804,[316]6.6768,[317]6.6583,[318]6.6363,[319]6.6513,[320]6.6646,[321]6.6703,[322]6.6661,[323]6.6598,[324]6.6586,[325]6.6711,[326]6.6704,[327]6.6734,[328]6.6757,[329]6.6834,[330]6.6872,[331]6.7003,[332]6.6962,[333]6.7041,[334]6.6979,[335]6.6901,[336]6.6922,[337]6.6888,[338]6.6896,[339]6.6829,[340]6.6784,[341]6.6871,[342]6.6901,[343]6.6962,[344]6.6964,[345]6.6953,[346]6.6908,[347]6.6946,[348]6.6986,[349]6.6998,[350]6.6957,[351]6.6962,[352]6.6968,[353]6.6902,[354]6.6924,[355]6.6989,[356]6.7021,[357]6.6984,[358]6.7090,[359]6.7119,[360]6.7066,[361]6.7052,[362]6.7144,[363]6.7251,[364]6.7318,[365]6.7372,[366]6.7380,[367]6.7469,[368]6.7429,[369]6.7430,[370]6.7444,[371]6.7381,[372]6.7424,[373]6.7475,[374]6.7446,[375]6.7432,[376]6.7509,[377]6.7460,[378]6.7483,[379]6.7546,[380]6.7455,[381]6.7411,[382]6.7370,[383]6.7360,[384]6.7360,[385]6.7342,[386]6.7340,[387]6.7332,[388]6.7288,[389]6.7223,[390]6.7152,[391]6.7065,[392]6.7026,[393]6.7020,[394]6.7049,[395]6.7036,[396]6.6953,[397]6.7020,[398]6.7058,[399]6.7139,[400]6.7138,[401]6.7148,[402]6.7157,[403]6.7171,[404]6.7237,[405]6.7152,[406]6.7121,[407]6.7122,[408]6.7134,[409]6.7250,[410]6.7371,[411]6.7497,[412]6.7668,[413]6.7790,[414]6.7873,[415]6.7931,[416]6.8026,[417]6.8166,[418]6.8216,[419]6.8294,[420]6.8394,[421]6.8512,[422]6.8556,[423]6.8647,[424]6.8765,[425]6.8857,[426]6.8922,[427]6.8970,[428]6.9061,[429]6.9121,[430]6.9217,[431]6.9381,[432]6.9418,[433]6.9406,[434]6.9354,[435]6.9354,[436]6.9371,[437]6.9470,[438]6.9553,[439]6.9516,[440]6.9503,[441]6.9448,[442]6.9418,[443]6.9432,[444]6.9441,[445]6.9416,[446]6.9440,[447]6.9471,[448]6.9513,[449]6.9487,[450]6.9482,[451]6.9434,[452]6.9351,[453]6.9262,[454]6.9219,[455]6.9226,[456]6.9276,[457]6.9302,[458]6.9276,[459]6.9285,[460]6.9373,[461]6.9347,[462]6.9338,[463]6.9396,[464]6.9390,[465]6.9360,[466]6.9289,[467]6.9300,[468]6.9310,[469]6.9335,[470]6.9342,[471]6.9290,[472]6.9336,[473]6.9272,[474]6.9292,[475]6.9246,[476]6.9274,[477]6.9200,[478]6.9199,[479]6.9293,[480]6.9351,[481]6.9373,[482]6.9325,[483]6.9273,[484]6.9295,[485]6.9287,[486]6.9227,[487]6.9233,[488]6.9219,[489]6.9161,[490]6.9132,[491]6.9109,[492]6.9048,[493]6.9019,[494]6.9008,[495]6.9013,[496]6.8970,[497]6.8923,[498]6.8908,[499]6.8846,[500]6.8749,[501]6.8686,[502]6.8689,[503]6.8673,[504]6.8577,[505]6.8599,[506]6.8605,[507]6.8559,[508]6.8514,[509]6.8506,[510]6.8548,[511]6.8601,[512]6.8632,[513]6.8650,[514]6.8722,[515]6.8664,[516]6.8648,[517]6.8661,[518]6.8655,[519]6.8687,[520]6.8713,[521]6.8724,[522]6.8757,[523]6.8762,[524]6.8826,[525]6.8860,[526]6.8879,[527]6.8899,[528]6.8851,[529]6.8861,[530]6.8799,[531]6.8775,[532]6.8826,[533]6.8846,[534]6.8821,[535]6.8854,[536]6.8789,[537]6.8761,[538]6.8816,[539]6.8823,[540]6.8880,[541]6.8893,[542]6.8899,[543]6.8916,[544]6.8923,[545]6.8906,[546]6.8919,[547]6.8868,[548]6.8803,[549]6.8806,[550]6.8770,[551]6.8732,[552]6.8708,[553]6.8665,[554]6.8638,[555]6.8593,[556]6.8589,[557]6.8627,[558]6.8587,[559]6.8583,[560]6.8578,[561]6.8580,[562]6.8555,[563]6.8559,[564]6.8605,[565]6.8629,[566]6.8624,[567]6.8599,[568]6.8593,[569]6.8572,[570]6.8601,[571]6.8603,[572]6.8609,[573]6.8598,[574]6.8565,[575]6.8570,[576]6.8572,[577]6.8550,[578]6.8537,[579]6.8547,[580]6.8475,[581]6.8430,[582]6.8412,[583]6.8415,[584]6.8412,[585]6.8333,[586]6.8264,[587]6.8268,[588]6.8321,[589]6.8388,[590]6.8420,[591]6.8438,[592]6.8424,[593]6.8377,[594]6.8379,[595]6.8352,[596]6.8387,[597]6.8353,[598]6.8323,[599]6.8340,[600]6.8332,[601]6.8313,[602]6.8341,[603]6.8370,[604]6.8381,[605]6.8409,[606]6.8429,[607]6.8417,[608]6.8373,[609]6.8375,[610]6.8410,[611]6.8390,[612]6.8421,[613]6.8380,[614]6.8332,[615]6.8241,[616]6.8281,[617]6.8211,[618]6.8153,[619]6.8084,[620]6.7923,[621]6.7843,[622]6.7822,[623]6.7841,[624]6.7839,[625]6.7838,[626]6.7822,[627]6.7851,[628]6.7848,[629]6.7844,[630]6.7879,[631]6.7937,[632]6.7994,[633]6.7975,[634]6.8018,[635]6.8029,[636]6.8007,[637]6.7977,[638]6.8010,[639]6.7979,[640]6.7991,[641]6.7990,[642]6.8064,[643]6.8086,[644]6.8094,[645]6.8069,[646]6.8121,[647]6.8090,[648]6.8099,[649]6.8094,[650]6.8144,[651]6.8197,[652]6.8211,[653]6.8252,[654]6.8175,[655]6.8160,

llama_print_timings: load time = 2658.59 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 976009.34 ms / 335360 tokens ( 2.91 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1004906.69 ms

7B, Q3_4 + Q5_1

main: seed = 1682864367 llama.cpp: loading model from ../build/junk1.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q3_4) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5578.28 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
1.69 seconds per pass - ETA 18 minutes
[1]4.3772,[2]4.7902,[3]5.6644,[4]6.2926,[5]6.4094,[6]6.3682,[7]6.5670,[8]6.6689,[9]7.0093,[10]7.2537,[11]7.4667,[12]7.5133,[13]7.4469,[14]7.5018,[15]7.7581,[16]7.3727,[17]7.2507,[18]7.1917,[19]6.8302,[20]6.8186,[21]6.7251,[22]6.5571,[23]6.5252,[24]6.4336,[25]6.4301,[26]6.2650,[27]6.0943,[28]5.9983,[29]5.9072,[30]5.7545,[31]5.7284,[32]5.7503,[33]5.6977,[34]5.7261,[35]5.7495,[36]5.7812,[37]5.7814,[38]5.7930,[39]5.8255,[40]5.8782,[41]5.8900,[42]5.9274,[43]5.8861,[44]5.9426,[45]5.9455,[46]5.9212,[47]5.9424,[48]5.9169,[49]5.9180,[50]5.8766,[51]5.8735,[52]5.8628,[53]5.9067,[54]5.8907,[55]5.8671,[56]5.8970,[57]5.9166,[58]5.9371,[59]5.9534,[60]5.9951,[61]5.9885,[62]6.0481,[63]6.0787,[64]6.0918,[65]6.1345,[66]6.1431,[67]6.1623,[68]6.1757,[69]6.1978,[70]6.2297,[71]6.2521,[72]6.2830,[73]6.3437,[74]6.3470,[75]6.3602,[76]6.3750,[77]6.3875,[78]6.3715,[79]6.4006,[80]6.3955,[81]6.4061,[82]6.4121,[83]6.3619,[84]6.3428,[85]6.3290,[86]6.3068,[87]6.2470,[88]6.2224,[89]6.2023,[90]6.1887,[91]6.2120,[92]6.2071,[93]6.2058,[94]6.2024,[95]6.2297,[96]6.2300,[97]6.2245,[98]6.2201,[99]6.2056,[100]6.2044,[101]6.2303,[102]6.2271,[103]6.2465,[104]6.2540,[105]6.2524,[106]6.2693,[107]6.2680,[108]6.2817,[109]6.2775,[110]6.2736,[111]6.2939,[112]6.3155,[113]6.3180,[114]6.3130,[115]6.3189,[116]6.3090,[117]6.3133,[118]6.3412,[119]6.3632,[120]6.3973,[121]6.4124,[122]6.4367,[123]6.4732,[124]6.4919,[125]6.4823,[126]6.5207,[127]6.5572,[128]6.5871,[129]6.5713,[130]6.5796,[131]6.5760,[132]6.5691,[133]6.5555,[134]6.5644,[135]6.5596,[136]6.5491,[137]6.5408,[138]6.5234,[139]6.5126,[140]6.5101,[141]6.4830,[142]6.4794,[143]6.4498,[144]6.4283,[145]6.4197,[146]6.4070,[147]6.4104,[148]6.4101,[149]6.4047,[150]6.4007,[151]6.4037,[152]6.3940,[153]6.3789,[154]6.3707,[155]6.3774,[156]6.3727,[157]6.3896,[158]6.3943,[159]6.3981,[160]6.4011,[161]6.4143,[162]6.3863,[163]6.3744,[164]6.3508,[165]6.3203,[166]6.2935,[167]6.2558,[168]6.2253,[169]6.2117,[170]6.2011,[171]6.1750,[172]6.1585,[173]6.1421,[174]6.1119,[175]6.0898,[176]6.0779,[177]6.0581,[178]6.0349,[179]6.0179,[180]6.0087,[181]5.9869,[182]5.9695,[183]5.9546,[184]5.9524,[185]5.9448,[186]5.9454,[187]5.9516,[188]5.9485,[189]5.9654,[190]5.9665,[191]5.9878,[192]6.0035,[193]6.0195,[194]6.0306,[195]6.0520,[196]6.0674,[197]6.0883,[198]6.1032,[199]6.1060,[200]6.1102,[201]6.1052,[202]6.1251,[203]6.1329,[204]6.1330,[205]6.1432,[206]6.1498,[207]6.1461,[208]6.1546,[209]6.1593,[210]6.1638,[211]6.1749,[212]6.1826,[213]6.1933,[214]6.1958,[215]6.1980,[216]6.2120,[217]6.2295,[218]6.2431,[219]6.2435,[220]6.2400,[221]6.2343,[222]6.2324,[223]6.2227,[224]6.2156,[225]6.2119,[226]6.2319,[227]6.2393,[228]6.2448,[229]6.2501,[230]6.2464,[231]6.2637,[232]6.2520,[233]6.2350,[234]6.2204,[235]6.2017,[236]6.1944,[237]6.1850,[238]6.1878,[239]6.1730,[240]6.1623,[241]6.1645,[242]6.1681,[243]6.1662,[244]6.1551,[245]6.1519,[246]6.1415,[247]6.1301,[248]6.1230,[249]6.1208,[250]6.1253,[251]6.1191,[252]6.1150,[253]6.1047,[254]6.0999,[255]6.0884,[256]6.0706,[257]6.0587,[258]6.0507,[259]6.0489,[260]6.0406,[261]6.0367,[262]6.0311,[263]6.0259,[264]6.0068,[265]6.0064,[266]6.0045,[267]5.9985,[268]6.0072,[269]6.0053,[270]6.0054,[271]6.0129,[272]6.0167,[273]6.0167,[274]6.0187,[275]6.0269,[276]6.0326,[277]6.0485,[278]6.0587,[279]6.0675,[280]6.0697,[281]6.0794,[282]6.0850,[283]6.0996,[284]6.1077,[285]6.1164,[286]6.1293,[287]6.1286,[288]6.1348,[289]6.1263,[290]6.1109,[291]6.0960,[292]6.0814,[293]6.0681,[294]6.0702,[295]6.0689,[296]6.0739,[297]6.0731,[298]6.0757,[299]6.0734,[300]6.0626,[301]6.0623,[302]6.0550,[303]6.0461,[304]6.0376,[305]6.0340,[306]6.0213,[307]6.0235,[308]6.0259,[309]6.0104,[310]6.0047,[311]5.9983,[312]6.0006,[313]5.9954,[314]5.9937,[315]5.9781,[316]5.9735,[317]5.9570,[318]5.9369,[319]5.9493,[320]5.9613,[321]5.9657,[322]5.9617,[323]5.9553,[324]5.9524,[325]5.9627,[326]5.9630,[327]5.9652,[328]5.9687,[329]5.9742,[330]5.9772,[331]5.9890,[332]5.9866,[333]5.9933,[334]5.9878,[335]5.9821,[336]5.9857,[337]5.9832,[338]5.9830,[339]5.9774,[340]5.9734,[341]5.9818,[342]5.9841,[343]5.9892,[344]5.9893,[345]5.9891,[346]5.9866,[347]5.9903,[348]5.9941,[349]5.9966,[350]5.9933,[351]5.9943,[352]5.9944,[353]5.9882,[354]5.9888,[355]5.9939,[356]5.9971,[357]5.9933,[358]6.0026,[359]6.0051,[360]6.0016,[361]6.0011,[362]6.0076,[363]6.0186,[364]6.0244,[365]6.0288,[366]6.0300,[367]6.0383,[368]6.0357,[369]6.0366,[370]6.0381,[371]6.0325,[372]6.0374,[373]6.0423,[374]6.0407,[375]6.0407,[376]6.0479,[377]6.0431,[378]6.0457,[379]6.0517,[380]6.0436,[381]6.0401,[382]6.0350,[383]6.0341,[384]6.0336,[385]6.0324,[386]6.0319,[387]6.0321,[388]6.0285,[389]6.0233,[390]6.0163,[391]6.0085,[392]6.0043,[393]6.0027,[394]6.0054,[395]6.0045,[396]5.9971,[397]6.0038,[398]6.0070,[399]6.0144,[400]6.0143,[401]6.0162,[402]6.0175,[403]6.0195,[404]6.0260,[405]6.0171,[406]6.0138,[407]6.0137,[408]6.0151,[409]6.0260,[410]6.0373,[411]6.0488,[412]6.0648,[413]6.0758,[414]6.0837,[415]6.0893,[416]6.0976,[417]6.1097,[418]6.1132,[419]6.1196,[420]6.1286,[421]6.1404,[422]6.1437,[423]6.1507,[424]6.1612,[425]6.1700,[426]6.1762,[427]6.1805,[428]6.1889,[429]6.1939,[430]6.2021,[431]6.2158,[432]6.2192,[433]6.2186,[434]6.2143,[435]6.2147,[436]6.2173,[437]6.2269,[438]6.2344,[439]6.2311,[440]6.2303,[441]6.2254,[442]6.2239,[443]6.2248,[444]6.2251,[445]6.2235,[446]6.2256,[447]6.2287,[448]6.2331,[449]6.2309,[450]6.2317,[451]6.2277,[452]6.2160,[453]6.2077,[454]6.2022,[455]6.2032,[456]6.2077,[457]6.2097,[458]6.2074,[459]6.2078,[460]6.2164,[461]6.2138,[462]6.2124,[463]6.2159,[464]6.2147,[465]6.2119,[466]6.2042,[467]6.2045,[468]6.2042,[469]6.2063,[470]6.2065,[471]6.2018,[472]6.2062,[473]6.2008,[474]6.2022,[475]6.1968,[476]6.1979,[477]6.1909,[478]6.1900,[479]6.1963,[480]6.2009,[481]6.2029,[482]6.1985,[483]6.1947,[484]6.1965,[485]6.1950,[486]6.1894,[487]6.1893,[488]6.1870,[489]6.1822,[490]6.1798,[491]6.1769,[492]6.1713,[493]6.1685,[494]6.1666,[495]6.1657,[496]6.1621,[497]6.1566,[498]6.1552,[499]6.1510,[500]6.1418,[501]6.1353,[502]6.1357,[503]6.1348,[504]6.1260,[505]6.1281,[506]6.1291,[507]6.1237,[508]6.1196,[509]6.1191,[510]6.1228,[511]6.1273,[512]6.1307,[513]6.1326,[514]6.1387,[515]6.1333,[516]6.1324,[517]6.1335,[518]6.1329,[519]6.1359,[520]6.1383,[521]6.1395,[522]6.1422,[523]6.1429,[524]6.1484,[525]6.1514,[526]6.1522,[527]6.1539,[528]6.1487,[529]6.1497,[530]6.1448,[531]6.1435,[532]6.1480,[533]6.1501,[534]6.1479,[535]6.1502,[536]6.1447,[537]6.1427,[538]6.1480,[539]6.1493,[540]6.1532,[541]6.1536,[542]6.1547,[543]6.1562,[544]6.1572,[545]6.1552,[546]6.1558,[547]6.1516,[548]6.1467,[549]6.1467,[550]6.1435,[551]6.1400,[552]6.1379,[553]6.1341,[554]6.1320,[555]6.1288,[556]6.1284,[557]6.1311,[558]6.1276,[559]6.1275,[560]6.1274,[561]6.1276,[562]6.1254,[563]6.1250,[564]6.1294,[565]6.1314,[566]6.1311,[567]6.1289,[568]6.1295,[569]6.1282,[570]6.1314,[571]6.1318,[572]6.1329,[573]6.1328,[574]6.1292,[575]6.1285,[576]6.1285,[577]6.1270,[578]6.1253,[579]6.1258,[580]6.1193,[581]6.1157,[582]6.1148,[583]6.1157,[584]6.1159,[585]6.1085,[586]6.1019,[587]6.1027,[588]6.1075,[589]6.1127,[590]6.1157,[591]6.1177,[592]6.1163,[593]6.1129,[594]6.1139,[595]6.1116,[596]6.1149,[597]6.1126,[598]6.1096,[599]6.1119,[600]6.1112,[601]6.1097,[602]6.1113,[603]6.1138,[604]6.1148,[605]6.1181,[606]6.1204,[607]6.1190,[608]6.1156,[609]6.1163,[610]6.1198,[611]6.1182,[612]6.1210,[613]6.1173,[614]6.1121,[615]6.1047,[616]6.1072,[617]6.1011,[618]6.0962,[619]6.0906,[620]6.0769,[621]6.0702,[622]6.0685,[623]6.0700,[624]6.0706,[625]6.0708,[626]6.0697,[627]6.0722,[628]6.0721,[629]6.0717,[630]6.0749,[631]6.0804,[632]6.0860,[633]6.0846,[634]6.0879,[635]6.0883,[636]6.0857,[637]6.0821,[638]6.0845,[639]6.0814,[640]6.0823,[641]6.0823,[642]6.0888,[643]6.0912,[644]6.0926,[645]6.0911,[646]6.0953,[647]6.0917,[648]6.0926,[649]6.0928,[650]6.0968,[651]6.1019,[652]6.1030,[653]6.1068,[654]6.1003,[655]6.0996,

llama_print_timings: load time = 2722.23 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 1024717.17 ms / 335360 tokens ( 3.06 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1052144.49 ms

13B, Q2_4 + Q5_1

main: seed = 1682869318 llama.cpp: loading model from ../build/junk3.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 15 (mostly Q2_4) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 8284.13 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.66 seconds per pass - ETA 29 minutes
[1]3.9721,[2]4.4565,[3]5.2597,[4]5.9033,[5]6.0338,[6]5.9622,[7]6.1601,[8]6.2656,[9]6.5236,[10]6.7829,[11]7.0013,[12]7.0990,[13]7.0476,[14]7.1859,[15]7.4172,[16]7.0334,[17]6.9135,[18]6.8786,[19]6.5400,[20]6.5070,[21]6.4154,[22]6.2361,[23]6.1969,[24]6.0978,[25]6.1050,[26]5.9363,[27]5.7476,[28]5.6483,[29]5.5673,[30]5.4279,[31]5.3936,[32]5.4050,[33]5.3676,[34]5.4138,[35]5.4320,[36]5.4581,[37]5.4515,[38]5.4411,[39]5.4719,[40]5.5259,[41]5.5538,[42]5.5976,[43]5.5570,[44]5.6013,[45]5.6034,[46]5.5709,[47]5.5914,[48]5.5665,[49]5.5732,[50]5.5425,[51]5.5525,[52]5.5445,[53]5.5919,[54]5.5826,[55]5.5603,[56]5.5833,[57]5.6051,[58]5.6318,[59]5.6509,[60]5.6875,[61]5.6795,[62]5.7390,[63]5.7665,[64]5.7749,[65]5.8128,[66]5.8121,[67]5.8288,[68]5.8395,[69]5.8698,[70]5.9008,[71]5.9260,[72]5.9637,[73]6.0136,[74]6.0186,[75]6.0287,[76]6.0452,[77]6.0630,[78]6.0533,[79]6.0811,[80]6.0773,[81]6.0940,[82]6.0916,[83]6.0430,[84]6.0330,[85]6.0263,[86]6.0075,[87]5.9484,[88]5.9136,[89]5.8907,[90]5.8817,[91]5.9066,[92]5.9045,[93]5.9073,[94]5.9044,[95]5.9318,[96]5.9287,[97]5.9266,[98]5.9230,[99]5.9158,[100]5.9112,[101]5.9382,[102]5.9320,[103]5.9477,[104]5.9492,[105]5.9504,[106]5.9664,[107]5.9629,[108]5.9788,[109]5.9759,[110]5.9701,[111]5.9873,[112]6.0048,[113]6.0053,[114]6.0034,[115]6.0066,[116]5.9956,[117]5.9974,[118]6.0221,[119]6.0409,[120]6.0689,[121]6.0859,[122]6.1071,[123]6.1462,[124]6.1624,[125]6.1543,[126]6.1905,[127]6.2246,[128]6.2549,[129]6.2404,[130]6.2488,[131]6.2446,[132]6.2396,[133]6.2282,[134]6.2393,[135]6.2381,[136]6.2292,[137]6.2257,[138]6.2107,[139]6.2011,[140]6.2007,[141]6.1759,[142]6.1720,[143]6.1485,[144]6.1310,[145]6.1232,[146]6.1096,[147]6.1134,[148]6.1171,[149]6.1134,[150]6.1137,[151]6.1188,[152]6.1085,[153]6.0971,[154]6.0912,[155]6.0971,[156]6.0959,[157]6.1130,[158]6.1163,[159]6.1179,[160]6.1222,[161]6.1343,[162]6.1058,[163]6.0948,[164]6.0712,[165]6.0433,[166]6.0157,[167]5.9800,[168]5.9506,[169]5.9372,[170]5.9281,[171]5.9059,[172]5.8906,[173]5.8760,[174]5.8466,[175]5.8240,[176]5.8097,[177]5.7904,[178]5.7676,[179]5.7527,[180]5.7448,[181]5.7262,[182]5.7087,[183]5.6951,[184]5.6934,[185]5.6880,[186]5.6896,[187]5.6955,[188]5.6950,[189]5.7125,[190]5.7137,[191]5.7318,[192]5.7454,[193]5.7611,[194]5.7725,[195]5.7933,[196]5.8061,[197]5.8252,[198]5.8379,[199]5.8412,[200]5.8429,[201]5.8373,[202]5.8553,[203]5.8624,[204]5.8614,[205]5.8712,[206]5.8764,[207]5.8730,[208]5.8796,[209]5.8831,[210]5.8888,[211]5.8998,[212]5.9059,[213]5.9151,[214]5.9187,[215]5.9211,[216]5.9327,[217]5.9480,[218]5.9628,[219]5.9616,[220]5.9575,[221]5.9532,[222]5.9522,[223]5.9451,[224]5.9380,[225]5.9342,[226]5.9546,[227]5.9632,[228]5.9704,[229]5.9768,[230]5.9741,[231]5.9897,[232]5.9793,[233]5.9619,[234]5.9460,[235]5.9269,[236]5.9199,[237]5.9101,[238]5.9136,[239]5.9005,[240]5.8905,[241]5.8933,[242]5.8951,[243]5.8940,[244]5.8841,[245]5.8807,[246]5.8702,[247]5.8600,[248]5.8524,[249]5.8490,[250]5.8530,[251]5.8444,[252]5.8398,[253]5.8301,[254]5.8258,[255]5.8145,[256]5.7966,[257]5.7849,[258]5.7773,[259]5.7751,[260]5.7663,[261]5.7617,[262]5.7569,[263]5.7506,[264]5.7336,[265]5.7324,[266]5.7288,[267]5.7218,[268]5.7296,[269]5.7292,[270]5.7288,[271]5.7352,[272]5.7392,[273]5.7404,[274]5.7412,[275]5.7480,[276]5.7552,[277]5.7693,[278]5.7775,[279]5.7851,[280]5.7886,[281]5.7987,[282]5.8037,[283]5.8164,[284]5.8253,[285]5.8337,[286]5.8472,[287]5.8436,[288]5.8492,[289]5.8424,[290]5.8276,[291]5.8144,[292]5.8000,[293]5.7865,[294]5.7876,[295]5.7878,[296]5.7926,[297]5.7915,[298]5.7945,[299]5.7927,[300]5.7833,[301]5.7830,[302]5.7765,[303]5.7683,[304]5.7593,[305]5.7559,[306]5.7445,[307]5.7467,[308]5.7477,[309]5.7330,[310]5.7289,[311]5.7237,[312]5.7254,[313]5.7193,[314]5.7175,[315]5.7028,[316]5.7002,[317]5.6864,[318]5.6684,[319]5.6795,[320]5.6913,[321]5.6957,[322]5.6918,[323]5.6866,[324]5.6844,[325]5.6958,[326]5.6973,[327]5.6980,[328]5.7006,[329]5.7052,[330]5.7076,[331]5.7186,[332]5.7145,[333]5.7218,[334]5.7162,[335]5.7112,[336]5.7145,[337]5.7124,[338]5.7122,[339]5.7075,[340]5.7049,[341]5.7114,[342]5.7142,[343]5.7184,[344]5.7186,[345]5.7190,[346]5.7163,[347]5.7199,[348]5.7239,[349]5.7266,[350]5.7247,[351]5.7257,[352]5.7258,[353]5.7197,[354]5.7213,[355]5.7262,[356]5.7298,[357]5.7267,[358]5.7352,[359]5.7368,[360]5.7326,[361]5.7319,[362]5.7393,[363]5.7502,[364]5.7557,[365]5.7603,[366]5.7622,[367]5.7718,[368]5.7690,[369]5.7704,[370]5.7726,[371]5.7684,[372]5.7738,[373]5.7777,[374]5.7762,[375]5.7754,[376]5.7821,[377]5.7776,[378]5.7799,[379]5.7845,[380]5.7770,[381]5.7744,[382]5.7703,[383]5.7691,[384]5.7695,[385]5.7677,[386]5.7658,[387]5.7652,[388]5.7614,[389]5.7571,[390]5.7516,[391]5.7449,[392]5.7416,[393]5.7411,[394]5.7436,[395]5.7422,[396]5.7362,[397]5.7439,[398]5.7481,[399]5.7549,[400]5.7535,[401]5.7535,[402]5.7548,[403]5.7576,[404]5.7632,[405]5.7513,[406]5.7473,[407]5.7472,[408]5.7478,[409]5.7592,[410]5.7692,[411]5.7790,[412]5.7942,[413]5.8061,[414]5.8126,[415]5.8184,[416]5.8260,[417]5.8361,[418]5.8395,[419]5.8448,[420]5.8530,[421]5.8638,[422]5.8671,[423]5.8735,[424]5.8834,[425]5.8915,[426]5.8979,[427]5.9019,[428]5.9094,[429]5.9138,[430]5.9201,[431]5.9333,[432]5.9357,[433]5.9345,[434]5.9307,[435]5.9313,[436]5.9338,[437]5.9427,[438]5.9507,[439]5.9471,[440]5.9469,[441]5.9426,[442]5.9404,[443]5.9406,[444]5.9423,[445]5.9409,[446]5.9426,[447]5.9445,[448]5.9481,[449]5.9461,[450]5.9467,[451]5.9429,[452]5.9315,[453]5.9228,[454]5.9169,[455]5.9165,[456]5.9207,[457]5.9221,[458]5.9201,[459]5.9199,[460]5.9272,[461]5.9232,[462]5.9207,[463]5.9215,[464]5.9202,[465]5.9184,[466]5.9109,[467]5.9116,[468]5.9107,[469]5.9121,[470]5.9116,[471]5.9063,[472]5.9090,[473]5.9043,[474]5.9051,[475]5.8991,[476]5.8992,[477]5.8916,[478]5.8895,[479]5.8934,[480]5.8974,[481]5.8984,[482]5.8939,[483]5.8903,[484]5.8919,[485]5.8881,[486]5.8823,[487]5.8820,[488]5.8798,[489]5.8753,[490]5.8725,[491]5.8696,[492]5.8638,[493]5.8606,[494]5.8583,[495]5.8567,[496]5.8530,[497]5.8476,[498]5.8459,[499]5.8416,[500]5.8336,[501]5.8267,[502]5.8268,[503]5.8254,[504]5.8169,[505]5.8170,[506]5.8175,[507]5.8126,[508]5.8087,[509]5.8087,[510]5.8112,[511]5.8161,[512]5.8199,[513]5.8225,[514]5.8285,[515]5.8238,[516]5.8229,[517]5.8235,[518]5.8235,[519]5.8256,[520]5.8277,[521]5.8293,[522]5.8309,[523]5.8315,[524]5.8376,[525]5.8404,[526]5.8412,[527]5.8427,[528]5.8372,[529]5.8381,[530]5.8332,[531]5.8322,[532]5.8378,[533]5.8404,[534]5.8386,[535]5.8415,[536]5.8367,[537]5.8347,[538]5.8397,[539]5.8405,[540]5.8437,[541]5.8446,[542]5.8459,[543]5.8479,[544]5.8487,[545]5.8477,[546]5.8480,[547]5.8440,[548]5.8391,[549]5.8387,[550]5.8373,[551]5.8341,[552]5.8323,[553]5.8286,[554]5.8264,[555]5.8236,[556]5.8227,[557]5.8249,[558]5.8214,[559]5.8214,[560]5.8201,[561]5.8207,[562]5.8182,[563]5.8181,[564]5.8226,[565]5.8238,[566]5.8239,[567]5.8217,[568]5.8222,[569]5.8203,[570]5.8232,[571]5.8238,[572]5.8245,[573]5.8244,[574]5.8218,[575]5.8204,[576]5.8199,[577]5.8177,[578]5.8157,[579]5.8151,[580]5.8094,[581]5.8061,[582]5.8056,[583]5.8061,[584]5.8063,[585]5.8000,[586]5.7940,[587]5.7944,[588]5.7986,[589]5.8039,[590]5.8066,[591]5.8080,[592]5.8069,[593]5.8037,[594]5.8046,[595]5.8026,[596]5.8064,[597]5.8039,[598]5.8005,[599]5.8030,[600]5.8020,[601]5.8005,[602]5.8020,[603]5.8043,[604]5.8051,[605]5.8076,[606]5.8089,[607]5.8077,[608]5.8045,[609]5.8050,[610]5.8095,[611]5.8080,[612]5.8100,[613]5.8066,[614]5.8025,[615]5.7953,[616]5.7980,[617]5.7922,[618]5.7870,[619]5.7819,[620]5.7695,[621]5.7639,[622]5.7617,[623]5.7633,[624]5.7634,[625]5.7641,[626]5.7634,[627]5.7661,[628]5.7669,[629]5.7676,[630]5.7704,[631]5.7748,[632]5.7803,[633]5.7787,[634]5.7821,[635]5.7818,[636]5.7784,[637]5.7751,[638]5.7773,[639]5.7740,[640]5.7749,[641]5.7755,[642]5.7812,[643]5.7830,[644]5.7850,[645]5.7839,[646]5.7875,[647]5.7830,[648]5.7841,[649]5.7844,[650]5.7874,[651]5.7912,[652]5.7914,[653]5.7952,[654]5.7891,[655]5.7880,

llama_print_timings: load time = 3757.20 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 1660958.51 ms / 335360 tokens ( 4.95 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1690237.58 ms

13B, Q3_4 + Q5_1

main: seed = 1682871034 llama.cpp: loading model from ../build/junk2.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 10 (mostly Q3_4) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 9353.66 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.78 seconds per pass - ETA 30 minutes
[1]3.8638,[2]4.2438,[3]5.0285,[4]5.4691,[5]5.6500,[6]5.5749,[7]5.7279,[8]5.8483,[9]6.0969,[10]6.3309,[11]6.5275,[12]6.5851,[13]6.5289,[14]6.6307,[15]6.8287,[16]6.5065,[17]6.4160,[18]6.3973,[19]6.1030,[20]6.0757,[21]6.0022,[22]5.8315,[23]5.8015,[24]5.7102,[25]5.7213,[26]5.5727,[27]5.3968,[28]5.2983,[29]5.2181,[30]5.0845,[31]5.0477,[32]5.0637,[33]5.0127,[34]5.0546,[35]5.0779,[36]5.0973,[37]5.0901,[38]5.0859,[39]5.1136,[40]5.1553,[41]5.1784,[42]5.2145,[43]5.1772,[44]5.2185,[45]5.2210,[46]5.1948,[47]5.2217,[48]5.2018,[49]5.2068,[50]5.1752,[51]5.1815,[52]5.1747,[53]5.2203,[54]5.2098,[55]5.1910,[56]5.2134,[57]5.2298,[58]5.2522,[59]5.2691,[60]5.3031,[61]5.2955,[62]5.3497,[63]5.3745,[64]5.3859,[65]5.4235,[66]5.4215,[67]5.4387,[68]5.4507,[69]5.4795,[70]5.5091,[71]5.5318,[72]5.5665,[73]5.6155,[74]5.6227,[75]5.6330,[76]5.6482,[77]5.6602,[78]5.6456,[79]5.6718,[80]5.6661,[81]5.6766,[82]5.6737,[83]5.6292,[84]5.6184,[85]5.6138,[86]5.5979,[87]5.5348,[88]5.4920,[89]5.4704,[90]5.4603,[91]5.4818,[92]5.4778,[93]5.4791,[94]5.4792,[95]5.5064,[96]5.5025,[97]5.4993,[98]5.4960,[99]5.4891,[100]5.4861,[101]5.5101,[102]5.5053,[103]5.5205,[104]5.5248,[105]5.5265,[106]5.5405,[107]5.5396,[108]5.5541,[109]5.5528,[110]5.5473,[111]5.5658,[112]5.5820,[113]5.5806,[114]5.5798,[115]5.5831,[116]5.5710,[117]5.5707,[118]5.5948,[119]5.6120,[120]5.6409,[121]5.6560,[122]5.6767,[123]5.7139,[124]5.7325,[125]5.7269,[126]5.7621,[127]5.7945,[128]5.8224,[129]5.8098,[130]5.8178,[131]5.8138,[132]5.8098,[133]5.7973,[134]5.8061,[135]5.8063,[136]5.7966,[137]5.7923,[138]5.7782,[139]5.7695,[140]5.7679,[141]5.7406,[142]5.7374,[143]5.7121,[144]5.6965,[145]5.6883,[146]5.6761,[147]5.6809,[148]5.6838,[149]5.6807,[150]5.6801,[151]5.6846,[152]5.6780,[153]5.6682,[154]5.6629,[155]5.6696,[156]5.6674,[157]5.6831,[158]5.6844,[159]5.6853,[160]5.6893,[161]5.7012,[162]5.6757,[163]5.6658,[164]5.6443,[165]5.6185,[166]5.5949,[167]5.5631,[168]5.5362,[169]5.5232,[170]5.5140,[171]5.4926,[172]5.4804,[173]5.4673,[174]5.4398,[175]5.4193,[176]5.4058,[177]5.3890,[178]5.3683,[179]5.3554,[180]5.3479,[181]5.3314,[182]5.3149,[183]5.3022,[184]5.3012,[185]5.2933,[186]5.2949,[187]5.3005,[188]5.2977,[189]5.3145,[190]5.3146,[191]5.3317,[192]5.3451,[193]5.3599,[194]5.3711,[195]5.3902,[196]5.4024,[197]5.4218,[198]5.4356,[199]5.4370,[200]5.4385,[201]5.4321,[202]5.4455,[203]5.4518,[204]5.4465,[205]5.4555,[206]5.4606,[207]5.4563,[208]5.4615,[209]5.4655,[210]5.4714,[211]5.4821,[212]5.4887,[213]5.4977,[214]5.5005,[215]5.5040,[216]5.5156,[217]5.5323,[218]5.5463,[219]5.5465,[220]5.5432,[221]5.5386,[222]5.5383,[223]5.5318,[224]5.5251,[225]5.5218,[226]5.5412,[227]5.5471,[228]5.5546,[229]5.5616,[230]5.5581,[231]5.5732,[232]5.5628,[233]5.5476,[234]5.5337,[235]5.5123,[236]5.5070,[237]5.4979,[238]5.5009,[239]5.4895,[240]5.4804,[241]5.4834,[242]5.4855,[243]5.4848,[244]5.4748,[245]5.4715,[246]5.4611,[247]5.4513,[248]5.4448,[249]5.4415,[250]5.4449,[251]5.4366,[252]5.4321,[253]5.4228,[254]5.4190,[255]5.4096,[256]5.3932,[257]5.3831,[258]5.3762,[259]5.3759,[260]5.3675,[261]5.3624,[262]5.3579,[263]5.3528,[264]5.3296,[265]5.3294,[266]5.3264,[267]5.3203,[268]5.3272,[269]5.3268,[270]5.3278,[271]5.3341,[272]5.3373,[273]5.3387,[274]5.3400,[275]5.3461,[276]5.3526,[277]5.3652,[278]5.3739,[279]5.3822,[280]5.3861,[281]5.3953,[282]5.4006,[283]5.4136,[284]5.4226,[285]5.4297,[286]5.4423,[287]5.4388,[288]5.4440,[289]5.4380,[290]5.4237,[291]5.4105,[292]5.3974,[293]5.3855,[294]5.3862,[295]5.3865,[296]5.3913,[297]5.3903,[298]5.3920,[299]5.3896,[300]5.3805,[301]5.3810,[302]5.3746,[303]5.3659,[304]5.3587,[305]5.3564,[306]5.3456,[307]5.3488,[308]5.3494,[309]5.3358,[310]5.3325,[311]5.3280,[312]5.3295,[313]5.3237,[314]5.3223,[315]5.3092,[316]5.3057,[317]5.2930,[318]5.2766,[319]5.2871,[320]5.2986,[321]5.3035,[322]5.3005,[323]5.2960,[324]5.2941,[325]5.3036,[326]5.3052,[327]5.3061,[328]5.3092,[329]5.3144,[330]5.3167,[331]5.3270,[332]5.3231,[333]5.3307,[334]5.3259,[335]5.3205,[336]5.3228,[337]5.3216,[338]5.3212,[339]5.3170,[340]5.3142,[341]5.3207,[342]5.3238,[343]5.3283,[344]5.3288,[345]5.3303,[346]5.3287,[347]5.3323,[348]5.3360,[349]5.3380,[350]5.3362,[351]5.3372,[352]5.3372,[353]5.3320,[354]5.3332,[355]5.3378,[356]5.3408,[357]5.3376,[358]5.3456,[359]5.3477,[360]5.3443,[361]5.3442,[362]5.3513,[363]5.3619,[364]5.3670,[365]5.3709,[366]5.3726,[367]5.3815,[368]5.3790,[369]5.3802,[370]5.3821,[371]5.3781,[372]5.3829,[373]5.3872,[374]5.3851,[375]5.3848,[376]5.3909,[377]5.3874,[378]5.3900,[379]5.3941,[380]5.3870,[381]5.3839,[382]5.3800,[383]5.3783,[384]5.3780,[385]5.3769,[386]5.3759,[387]5.3759,[388]5.3730,[389]5.3694,[390]5.3640,[391]5.3582,[392]5.3547,[393]5.3541,[394]5.3571,[395]5.3565,[396]5.3512,[397]5.3575,[398]5.3618,[399]5.3687,[400]5.3677,[401]5.3684,[402]5.3694,[403]5.3719,[404]5.3775,[405]5.3626,[406]5.3584,[407]5.3576,[408]5.3585,[409]5.3694,[410]5.3786,[411]5.3883,[412]5.4022,[413]5.4125,[414]5.4185,[415]5.4245,[416]5.4314,[417]5.4411,[418]5.4435,[419]5.4482,[420]5.4558,[421]5.4656,[422]5.4689,[423]5.4741,[424]5.4831,[425]5.4906,[426]5.4966,[427]5.5008,[428]5.5078,[429]5.5114,[430]5.5176,[431]5.5305,[432]5.5337,[433]5.5328,[434]5.5292,[435]5.5305,[436]5.5332,[437]5.5416,[438]5.5489,[439]5.5459,[440]5.5453,[441]5.5407,[442]5.5392,[443]5.5403,[444]5.5420,[445]5.5411,[446]5.5430,[447]5.5452,[448]5.5483,[449]5.5468,[450]5.5479,[451]5.5449,[452]5.5295,[453]5.5197,[454]5.5145,[455]5.5148,[456]5.5189,[457]5.5200,[458]5.5180,[459]5.5180,[460]5.5252,[461]5.5215,[462]5.5181,[463]5.5165,[464]5.5162,[465]5.5143,[466]5.5069,[467]5.5058,[468]5.5037,[469]5.5049,[470]5.5041,[471]5.4993,[472]5.5002,[473]5.4956,[474]5.4943,[475]5.4876,[476]5.4859,[477]5.4775,[478]5.4748,[479]5.4757,[480]5.4784,[481]5.4786,[482]5.4740,[483]5.4698,[484]5.4706,[485]5.4645,[486]5.4580,[487]5.4569,[488]5.4541,[489]5.4487,[490]5.4458,[491]5.4424,[492]5.4359,[493]5.4332,[494]5.4314,[495]5.4290,[496]5.4250,[497]5.4188,[498]5.4162,[499]5.4126,[500]5.4045,[501]5.3978,[502]5.3970,[503]5.3960,[504]5.3887,[505]5.3887,[506]5.3894,[507]5.3838,[508]5.3802,[509]5.3806,[510]5.3827,[511]5.3868,[512]5.3907,[513]5.3932,[514]5.3987,[515]5.3948,[516]5.3937,[517]5.3936,[518]5.3936,[519]5.3960,[520]5.3972,[521]5.3983,[522]5.4000,[523]5.4006,[524]5.4058,[525]5.4084,[526]5.4089,[527]5.4106,[528]5.4052,[529]5.4058,[530]5.4021,[531]5.4018,[532]5.4066,[533]5.4091,[534]5.4073,[535]5.4099,[536]5.4055,[537]5.4036,[538]5.4085,[539]5.4093,[540]5.4112,[541]5.4111,[542]5.4125,[543]5.4145,[544]5.4157,[545]5.4145,[546]5.4147,[547]5.4114,[548]5.4073,[549]5.4074,[550]5.4054,[551]5.4028,[552]5.4010,[553]5.3980,[554]5.3957,[555]5.3938,[556]5.3928,[557]5.3943,[558]5.3909,[559]5.3915,[560]5.3903,[561]5.3906,[562]5.3881,[563]5.3879,[564]5.3921,[565]5.3932,[566]5.3936,[567]5.3919,[568]5.3927,[569]5.3912,[570]5.3937,[571]5.3951,[572]5.3961,[573]5.3968,[574]5.3937,[575]5.3920,[576]5.3915,[577]5.3898,[578]5.3881,[579]5.3883,[580]5.3830,[581]5.3801,[582]5.3801,[583]5.3810,[584]5.3814,[585]5.3756,[586]5.3698,[587]5.3703,[588]5.3746,[589]5.3796,[590]5.3827,[591]5.3843,[592]5.3831,[593]5.3792,[594]5.3805,[595]5.3790,[596]5.3829,[597]5.3809,[598]5.3778,[599]5.3806,[600]5.3795,[601]5.3784,[602]5.3789,[603]5.3818,[604]5.3826,[605]5.3855,[606]5.3871,[607]5.3854,[608]5.3826,[609]5.3834,[610]5.3875,[611]5.3863,[612]5.3887,[613]5.3859,[614]5.3819,[615]5.3760,[616]5.3784,[617]5.3734,[618]5.3688,[619]5.3644,[620]5.3533,[621]5.3482,[622]5.3463,[623]5.3477,[624]5.3481,[625]5.3489,[626]5.3484,[627]5.3511,[628]5.3519,[629]5.3525,[630]5.3553,[631]5.3598,[632]5.3644,[633]5.3633,[634]5.3663,[635]5.3660,[636]5.3623,[637]5.3585,[638]5.3606,[639]5.3574,[640]5.3579,[641]5.3584,[642]5.3637,[643]5.3654,[644]5.3676,[645]5.3662,[646]5.3699,[647]5.3651,[648]5.3664,[649]5.3667,[650]5.3695,[651]5.3737,[652]5.3742,[653]5.3779,[654]5.3724,[655]5.3715,

llama_print_timings: load time = 3902.02 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 1737619.18 ms / 335360 tokens ( 5.18 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 1764928.09 ms

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-04-30T18:03:02Z

Great idea!
Do you think the RMSE metric can be somehow used to determine which quantization mode is best for a given layer? (in order to avoid recomputing perplexity)
Or maybe some other metric / heuristic? Entropy?

cmp-nct · 2023-05-01T15:06:09Z

I'd not compare it to audio, that variable bitrate is something totally different.
But I also considered trying that, what would be needed is an automated benchmarking tool that can permutate the quantization per layer on the fly (by using the full model on a box with enough RAM) and return quality score and performance, vram for each variant.
That could be kept running and in the end we'd have a nice table that shows which combinations are most successful.

There would also be merit in testing to keep the encoder block higher res and the decoder lowres also the early layers high res and the late layers lower resolution.
My reasoning is that the error accumulates with each layer, so a small error introduced in layer 1 might multiply into a big error in layer 40 but if we start with a higher quality calculation that gradually lowers we might get better results.
(That's why encoder and decoder should also be treated differently)

unbounded · 2023-05-04T21:23:06Z

My guess is that a locally adaptive variable bit rate would require a major change to ggml.

The easiest way I can think of would be to vary allocation within a row - since we always quantize/dequantize/dot vector by row. You would have to allocate a "worst case" length since the row size is fixed, but you could still save memory accesses on more compressible rows.
I have yet to come up with an encoding that would be close to performant, though.

ikawrakow · 2023-05-05T06:09:54Z

Suppose we want to quantize only half of the feed_forward.w2 and attention.wv tensors with a higher number of bits to reduce the model size compared to what I posted above. What would be the best strategy to do that? My initial intuition was that we pick the layers with the highest rmse to quantize with more bits. That strategy turned out to be the worst. @cmp-nct suggests that it should be the 1st half of the layers as errors made there would magnify down the network. This is better, but still not the best. Here is a list of strategies given in decreasing order of performance:

Quantize first 1/4, then every 3rd layer with more bits
Quantize every second layer with more bits
Quantize the first half with more bits
Quantize the highest-rmse half with more bits (which basically turns out to be the second half of the layers).

ikawrakow · 2023-05-05T06:44:55Z

I have added a bunch of new quantizations to play with in a quest to find the smallest model that gives a "reasonable" performance:

Q6_0: Same as Q4_0 and Q5_0, but uses 6 bits. This results in a perplexity of 5.9662 (7B) or 5.2540 (13B)
Q5_K: "Super-blocks" of 256 weights. Quants are 5 bits. There is a single fp16 scale per super-block. Each set (block) of 16 weights within a "super-block" has its own scale quantized to 8 bits. This results in 5.5625 bits per weight in total, so almost the same as Q5_0. It gives a perplexity of 5.9881 (7B) or 5.2719 (13B), so better than Q5_0 at the same model size.
Q3_K: As Q5_K, but using 3 bits per quant, so 3.5625 bits per weight.
Q4_K: As Q5_K, but using 4 bits per quant, so 4.5625 bits per weight

Here are some model sizes and perplexities where

output.weight is always quantized with Q6_0
feed_forward.w2 and attention.wv are quantized with a mix of Q3_K and Q5_K as follows
- "Tiny": quantize first quarter, then every 3rd of feed_forward.w2 and attention.wv with Q5_K
- "Small": quantize first quarter, then every 3rd of attention.wv with Q5_K, all layers of feed_forward.w2 with Q5_K
- "Normal" quantize all layers of feed_forward.w2 and attention.wv with Q5_K
All other tensors are quantized with Q3_K
A "Large" model, which uses the same strategy as "Tiny", but Q3_K is replaced with Q4_K quantization.

Model	File size	7B Perplexity
Tiny	3.07G	6.3063
Small	3.24G	6.2322
Normal	3.30G	6.1688
Large	3.72G	6.0347

They all satisfy the requirement to be usable on a phone or within a web browser by a comfortable margin. Considering that the llama.cpp 7B quantized perplexity given on the main page was 6.59 just a few weeks ago, I think their performance should be seen as "reasonable". The "Large" model is ~10% smaller than any of the 4-bit quantizations currently listed on the main page, while outperforming them by a comfortable margin.

If I arbitrarily define a perplexity of 6.59 as the threshold for "reasonable" performance, I'm not able to find a 2-bit quantization scheme that has a "reasonable" performance and is smaller than the 3-bit models in the above table.

cmp-nct · 2023-05-18T16:14:59Z

One thing that came to my mind during my latest tests:
when running multiplications on Nvidia GPUs we are forced into 16 bit multiplications. Only the next generation of GPUs will likely support the native FP8 format (currently only on Hopper GPUs).
Also running lower precision on custom cuda kernels does not seem to work well (extreme performance loss), I suppose a high effort project could close the performance gap (whatever magic clBlast is doing to get so close to cuBLAS)

When thinking about mixing precision of layers, we should keep in mind that caluclations on GPU are already 16 bit. Any lower precision does not gain performance for those cases, it loses performance from conversions.

Update: that's not relevant anymore

EwoutH · 2023-06-04T21:27:56Z

I encountered some other interesting methods and benchmarks here, just as background https://github.com/megvii-research/Sparsebit/blob/main/large_language_models/llama/quantization/README.md

ikawrakow · 2023-06-07T08:02:32Z

Closed via #1684

NightMachinery · 2023-09-19T04:06:46Z

One thing that came to my mind during my latest tests: when running multiplications on Nvidia GPUs we are forced into 16 bit multiplications. Only the next generation of GPUs will likely support the native FP8 format (currently only on Hopper GPUs). Also running lower precision on custom cuda kernels does not seem to work well (extreme performance loss), I suppose a high effort project could close the performance gap (whatever magic clBlast is doing to get so close to cuBLAS)

When thinking about mixing precision of layers, we should keep in mind that caluclations on GPU are already 16 bit. Any lower precision does not gain performance for those cases, it loses performance from conversions.

Update: that's not relevant anymore

How is it not relevant anymore?

What does clBlast do? Is it better than cuBLAS when using quantizied models?

mrtpk · 2024-10-15T08:25:26Z

After quantizing the weights, are they dequantized to float during computation (e.g., while computing dot product)? Or is the computation also done in the integer scale? Thanks.

ikawrakow added enhancement New feature or request generation quality Quality of model output labels Apr 30, 2023

ikawrakow mentioned this issue Apr 30, 2023

QX_4 quantization #1240

Closed

ikawrakow added the Less than 4 bits Efforts related to viable quantized models using <4 bits label Apr 30, 2023

ggerganov added this to ggml : sub-4-bit integer quantization Apr 30, 2023

ggerganov moved this to Todo in ggml : sub-4-bit integer quantization Apr 30, 2023

unbounded mentioned this issue May 8, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

ggerganov mentioned this issue May 19, 2023

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

Merged

3 tasks

ziwang-com mentioned this issue May 26, 2023

QLoRA 4 位量化 ziwang-com/zero-lora#58

Open

ikawrakow mentioned this issue Jun 3, 2023

k-quants #1684

Merged

ikawrakow closed this as completed Jun 7, 2023

github-project-automation bot moved this from Todo to Done in ggml : sub-4-bit integer quantization Jun 7, 2023

cmp-nct mentioned this issue Jun 12, 2023

[Feature Request] SpQR quantisation #1802

Closed

BarfingLemurs mentioned this issue Nov 22, 2023

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable bit rate quantization #1256

Variable bit rate quantization #1256

ikawrakow commented Apr 30, 2023 •

edited

Loading

ggerganov commented Apr 30, 2023

cmp-nct commented May 1, 2023 •

edited

Loading

unbounded commented May 4, 2023

ikawrakow commented May 5, 2023

ikawrakow commented May 5, 2023 •

edited

Loading

cmp-nct commented May 18, 2023 •

edited

Loading

EwoutH commented Jun 4, 2023

ikawrakow commented Jun 7, 2023

NightMachinery commented Sep 19, 2023

mrtpk commented Oct 15, 2024

Variable bit rate quantization #1256

Variable bit rate quantization #1256

Comments

ikawrakow commented Apr 30, 2023 • edited Loading

ggerganov commented Apr 30, 2023

cmp-nct commented May 1, 2023 • edited Loading

unbounded commented May 4, 2023

ikawrakow commented May 5, 2023

ikawrakow commented May 5, 2023 • edited Loading

cmp-nct commented May 18, 2023 • edited Loading

EwoutH commented Jun 4, 2023

ikawrakow commented Jun 7, 2023

NightMachinery commented Sep 19, 2023

mrtpk commented Oct 15, 2024

ikawrakow commented Apr 30, 2023 •

edited

Loading

cmp-nct commented May 1, 2023 •

edited

Loading

ikawrakow commented May 5, 2023 •

edited

Loading

cmp-nct commented May 18, 2023 •

edited

Loading