Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : alternative Q4_3 format + implementation #1108

Closed
wants to merge 2 commits into from
Closed

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 21, 2023

#define QK4_3 32
typedef struct {
    ggml_fp16_t d0;        // delta
    ggml_fp16_t d1;        // delta
    ggml_fp16_t m;         // min
    uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;

Running a perplexity test to see how much we lost from having single min factor in the structure instead of two

llama_print_timings:      sample time =    56.68 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   448.06 ms /     8 tokens (   56.01 ms per token)
llama_print_timings:        eval time =  3177.30 ms /    63 runs   (   50.43 ms per run)
llama_print_timings:       total time =  3691.84 ms

Perplexity results

Final ppl is [655] 6.1000

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_3.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_3a.txt 
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
main
quantize
quantize-stats
perplexity
embedding
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
ggml.c:1120:13: warning: unused function 'quantize_row_q4_2_reference' [-Wunused-function]
static void quantize_row_q4_2_reference(const float * restrict x, block_q4_2 * restrict y, int k) {
            ^
ggml.c:3243:20: warning: unused function 'ggml_vec_silu_f16' [-Wunused-function]
inline static void ggml_vec_silu_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
                   ^
ggml.c:3693:19: warning: unused function 'ggml_up64' [-Wunused-function]
static inline int ggml_up64(int n) {
                  ^
3 warnings generated.
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1682099926
llama.cpp: loading model from ./models/7B/ggml-model-q4_3.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6210.95 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
16.06 seconds per pass - ETA 2 hours 55 minutes
[1]4.3839,[2]4.8155,[3]5.6837,[4]6.3070,[5]6.4175,[6]6.3773,[7]6.5560,[8]6.6590,[9]6.9950,[10]7.2612,[11]7.4741,[12]7.5144,[13]7.4453,[14]7.4987,[15]7.7433,[16]7.3598,[17]7.2445,[18]7.2006,[19]6.8344,[20]6.8218,[21]6.7264,[22]6.5580,[23]6.5299,[24]6.4332,[25]6.4411,[26]6.2826,[27]6.1132,[28]6.0182,[29]5.9295,[30]5.7711,[31]5.7404,[32]5.7573,[33]5.7005,[34]5.7339,[35]5.7540,[36]5.7940,[37]5.7947,[38]5.8015,[39]5.8337,[40]5.8829,[41]5.8941,[42]5.9340,[43]5.8936,[44]5.9497,[45]5.9529,[46]5.9245,[47]5.9448,[48]5.9186,[49]5.9190,[50]5.8771,[51]5.8728,[52]5.8612,[53]5.9091,[54]5.8899,[55]5.8689,[56]5.8954,[57]5.9146,[58]5.9330,[59]5.9531,[60]5.9950,[61]5.9845,[62]6.0411,[63]6.0713,[64]6.0859,[65]6.1292,[66]6.1371,[67]6.1552,[68]6.1700,[69]6.1951,[70]6.2242,[71]6.2461,[72]6.2765,[73]6.3322,[74]6.3376,[75]6.3525,[76]6.3656,[77]6.3777,[78]6.3647,[79]6.3915,[80]6.3858,[81]6.4033,[82]6.4088,[83]6.3563,[84]6.3395,[85]6.3286,[86]6.3075,[87]6.2495,[88]6.2262,[89]6.2066,[90]6.1908,[91]6.2149,[92]6.2090,[93]6.2097,[94]6.2072,[95]6.2342,[96]6.2327,[97]6.2272,[98]6.2205,[99]6.2071,[100]6.2056,[101]6.2292,[102]6.2239,[103]6.2428,[104]6.2500,[105]6.2491,[106]6.2660,[107]6.2664,[108]6.2775,[109]6.2710,[110]6.2665,[111]6.2872,[112]6.3072,[113]6.3101,[114]6.3066,[115]6.3120,[116]6.3029,[117]6.3088,[118]6.3367,[119]6.3586,[120]6.3934,[121]6.4090,[122]6.4324,[123]6.4690,[124]6.4858,[125]6.4767,[126]6.5167,[127]6.5534,[128]6.5835,[129]6.5674,[130]6.5765,[131]6.5711,[132]6.5629,[133]6.5503,[134]6.5595,[135]6.5564,[136]6.5448,[137]6.5380,[138]6.5209,[139]6.5096,[140]6.5057,[141]6.4785,[142]6.4748,[143]6.4463,[144]6.4260,[145]6.4186,[146]6.4070,[147]6.4104,[148]6.4099,[149]6.4049,[150]6.4012,[151]6.4034,[152]6.3927,[153]6.3761,[154]6.3673,[155]6.3736,[156]6.3687,[157]6.3854,[158]6.3896,[159]6.3939,[160]6.3958,[161]6.4078,[162]6.3798,[163]6.3676,[164]6.3447,[165]6.3134,[166]6.2866,[167]6.2487,[168]6.2177,[169]6.2039,[170]6.1920,[171]6.1660,[172]6.1485,[173]6.1315,[174]6.1019,[175]6.0799,[176]6.0673,[177]6.0469,[178]6.0237,[179]6.0072,[180]5.9974,[181]5.9754,[182]5.9578,[183]5.9440,[184]5.9428,[185]5.9353,[186]5.9358,[187]5.9424,[188]5.9387,[189]5.9560,[190]5.9572,[191]5.9784,[192]5.9937,[193]6.0107,[194]6.0227,[195]6.0437,[196]6.0597,[197]6.0800,[198]6.0951,[199]6.0981,[200]6.1034,[201]6.0990,[202]6.1175,[203]6.1243,[204]6.1236,[205]6.1342,[206]6.1410,[207]6.1371,[208]6.1457,[209]6.1502,[210]6.1550,[211]6.1649,[212]6.1731,[213]6.1830,[214]6.1861,[215]6.1885,[216]6.2025,[217]6.2197,[218]6.2329,[219]6.2329,[220]6.2288,[221]6.2233,[222]6.2215,[223]6.2120,[224]6.2054,[225]6.2014,[226]6.2217,[227]6.2310,[228]6.2368,[229]6.2430,[230]6.2398,[231]6.2558,[232]6.2438,[233]6.2272,[234]6.2123,[235]6.1945,[236]6.1882,[237]6.1782,[238]6.1805,[239]6.1658,[240]6.1550,[241]6.1569,[242]6.1603,[243]6.1584,[244]6.1476,[245]6.1445,[246]6.1338,[247]6.1223,[248]6.1152,[249]6.1120,[250]6.1169,[251]6.1104,[252]6.1069,[253]6.0976,[254]6.0929,[255]6.0812,[256]6.0634,[257]6.0514,[258]6.0435,[259]6.0413,[260]6.0331,[261]6.0292,[262]6.0236,[263]6.0177,[264]5.9977,[265]5.9975,[266]5.9955,[267]5.9890,[268]5.9979,[269]5.9965,[270]5.9965,[271]6.0041,[272]6.0077,[273]6.0077,[274]6.0099,[275]6.0183,[276]6.0239,[277]6.0396,[278]6.0493,[279]6.0584,[280]6.0613,[281]6.0714,[282]6.0771,[283]6.0921,[284]6.1004,[285]6.1085,[286]6.1211,[287]6.1209,[288]6.1266,[289]6.1185,[290]6.1028,[291]6.0872,[292]6.0719,[293]6.0590,[294]6.0605,[295]6.0590,[296]6.0637,[297]6.0621,[298]6.0652,[299]6.0626,[300]6.0518,[301]6.0513,[302]6.0435,[303]6.0344,[304]6.0256,[305]6.0220,[306]6.0097,[307]6.0115,[308]6.0141,[309]5.9983,[310]5.9933,[311]5.9865,[312]5.9891,[313]5.9836,[314]5.9821,[315]5.9666,[316]5.9616,[317]5.9453,[318]5.9253,[319]5.9365,[320]5.9486,[321]5.9529,[322]5.9489,[323]5.9424,[324]5.9396,[325]5.9502,[326]5.9503,[327]5.9525,[328]5.9562,[329]5.9618,[330]5.9649,[331]5.9771,[332]5.9743,[333]5.9811,[334]5.9758,[335]5.9695,[336]5.9731,[337]5.9707,[338]5.9701,[339]5.9650,[340]5.9611,[341]5.9691,[342]5.9720,[343]5.9765,[344]5.9768,[345]5.9771,[346]5.9744,[347]5.9780,[348]5.9815,[349]5.9837,[350]5.9806,[351]5.9810,[352]5.9812,[353]5.9752,[354]5.9763,[355]5.9817,[356]5.9848,[357]5.9816,[358]5.9908,[359]5.9932,[360]5.9904,[361]5.9902,[362]5.9967,[363]6.0077,[364]6.0140,[365]6.0196,[366]6.0213,[367]6.0296,[368]6.0266,[369]6.0275,[370]6.0291,[371]6.0236,[372]6.0284,[373]6.0328,[374]6.0311,[375]6.0310,[376]6.0376,[377]6.0329,[378]6.0353,[379]6.0414,[380]6.0339,[381]6.0305,[382]6.0255,[383]6.0248,[384]6.0244,[385]6.0234,[386]6.0232,[387]6.0228,[388]6.0195,[389]6.0145,[390]6.0079,[391]6.0002,[392]5.9962,[393]5.9946,[394]5.9976,[395]5.9963,[396]5.9888,[397]5.9953,[398]5.9993,[399]6.0073,[400]6.0070,[401]6.0083,[402]6.0095,[403]6.0115,[404]6.0179,[405]6.0087,[406]6.0059,[407]6.0057,[408]6.0076,[409]6.0189,[410]6.0300,[411]6.0413,[412]6.0570,[413]6.0680,[414]6.0760,[415]6.0814,[416]6.0896,[417]6.1016,[418]6.1051,[419]6.1121,[420]6.1210,[421]6.1323,[422]6.1362,[423]6.1432,[424]6.1538,[425]6.1626,[426]6.1691,[427]6.1738,[428]6.1819,[429]6.1872,[430]6.1953,[431]6.2089,[432]6.2129,[433]6.2120,[434]6.2076,[435]6.2087,[436]6.2112,[437]6.2209,[438]6.2284,[439]6.2252,[440]6.2241,[441]6.2192,[442]6.2177,[443]6.2189,[444]6.2197,[445]6.2178,[446]6.2199,[447]6.2233,[448]6.2273,[449]6.2249,[450]6.2257,[451]6.2215,[452]6.2090,[453]6.2008,[454]6.1952,[455]6.1961,[456]6.2011,[457]6.2031,[458]6.2011,[459]6.2017,[460]6.2103,[461]6.2077,[462]6.2066,[463]6.2104,[464]6.2094,[465]6.2069,[466]6.1995,[467]6.2004,[468]6.2002,[469]6.2022,[470]6.2028,[471]6.1983,[472]6.2034,[473]6.1980,[474]6.1995,[475]6.1935,[476]6.1954,[477]6.1887,[478]6.1879,[479]6.1941,[480]6.1984,[481]6.2003,[482]6.1962,[483]6.1922,[484]6.1943,[485]6.1926,[486]6.1871,[487]6.1873,[488]6.1849,[489]6.1802,[490]6.1781,[491]6.1754,[492]6.1699,[493]6.1671,[494]6.1653,[495]6.1650,[496]6.1612,[497]6.1558,[498]6.1540,[499]6.1495,[500]6.1402,[501]6.1338,[502]6.1341,[503]6.1334,[504]6.1246,[505]6.1267,[506]6.1276,[507]6.1223,[508]6.1184,[509]6.1178,[510]6.1210,[511]6.1258,[512]6.1294,[513]6.1313,[514]6.1376,[515]6.1321,[516]6.1312,[517]6.1321,[518]6.1316,[519]6.1346,[520]6.1369,[521]6.1381,[522]6.1409,[523]6.1416,[524]6.1472,[525]6.1506,[526]6.1515,[527]6.1533,[528]6.1483,[529]6.1489,[530]6.1436,[531]6.1421,[532]6.1471,[533]6.1495,[534]6.1478,[535]6.1498,[536]6.1444,[537]6.1421,[538]6.1472,[539]6.1481,[540]6.1517,[541]6.1522,[542]6.1534,[543]6.1550,[544]6.1559,[545]6.1540,[546]6.1550,[547]6.1509,[548]6.1457,[549]6.1457,[550]6.1431,[551]6.1396,[552]6.1374,[553]6.1336,[554]6.1315,[555]6.1284,[556]6.1279,[557]6.1303,[558]6.1267,[559]6.1264,[560]6.1266,[561]6.1270,[562]6.1246,[563]6.1242,[564]6.1284,[565]6.1305,[566]6.1303,[567]6.1283,[568]6.1290,[569]6.1274,[570]6.1303,[571]6.1306,[572]6.1316,[573]6.1313,[574]6.1279,[575]6.1274,[576]6.1271,[577]6.1255,[578]6.1237,[579]6.1240,[580]6.1178,[581]6.1142,[582]6.1134,[583]6.1143,[584]6.1145,[585]6.1070,[586]6.1001,[587]6.1007,[588]6.1054,[589]6.1108,[590]6.1134,[591]6.1154,[592]6.1144,[593]6.1114,[594]6.1123,[595]6.1101,[596]6.1136,[597]6.1114,[598]6.1085,[599]6.1107,[600]6.1103,[601]6.1091,[602]6.1105,[603]6.1132,[604]6.1141,[605]6.1175,[606]6.1199,[607]6.1184,[608]6.1152,[609]6.1159,[610]6.1196,[611]6.1178,[612]6.1203,[613]6.1166,[614]6.1118,[615]6.1043,[616]6.1070,[617]6.1010,[618]6.0962,[619]6.0906,[620]6.0768,[621]6.0701,[622]6.0685,[623]6.0701,[624]6.0705,[625]6.0706,[626]6.0695,[627]6.0718,[628]6.0719,[629]6.0716,[630]6.0750,[631]6.0806,[632]6.0864,[633]6.0850,[634]6.0884,[635]6.0890,[636]6.0859,[637]6.0825,[638]6.0850,[639]6.0820,[640]6.0830,[641]6.0831,[642]6.0895,[643]6.0918,[644]6.0928,[645]6.0909,[646]6.0952,[647]6.0913,[648]6.0924,[649]6.0926,[650]6.0963,[651]6.1018,[652]6.1029,[653]6.1068,[654]6.1006,[655]6.1000,

llama_print_timings:        load time = 16626.65 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 5703556.31 ms / 335360 tokens (   17.01 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 5738455.43 ms

real	95m38.709s
user	138m57.455s
sys	4m20.864s

For comparison, with the approach on master, we get: 6.0617

This way we always use the same type of instruction across all quantizations
This one achieves 50m / token on M1 Pro
@ggerganov
Copy link
Owner Author

#1109 looks more promising

@ggerganov ggerganov closed this Apr 21, 2023
@ggerganov ggerganov deleted the q4_3a branch April 24, 2023 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant