Replies: 3 comments 6 replies
-
F16 used to be the focus for ML a long time, it and also BF16 had hardware support in GPUs added for just this reason. It cuts down on the memory use by half and also computations could be faster. It isn't well supported by CPUs still, I think for decades it was thought that we'd be moving from F32 to F64 more and more for increased precision. Even the old Intel 8087 coprocessor series from 1980 had 80-bit floating points for its internal registers. The new thing now in ML and especially LLM models is going even lower than 16 bits, 8-bit quantization from the bitsandbytes library allowed LoRA training to finetune LLaMA for the first time on consumer GPUs. Then came 4-bit with llama.cpp and GPTQ and others. Now even lower than 4 bits as well. |
Beta Was this translation helpful? Give feedback.
-
@soleblaze - very interesting question! There's a conversation in this repo about benchmarking llama.cpp releases to monitor overall performace in the codebase. But what I haven't yet seen is discussion how different hardware and aspects of hardware (eg memory bandwidth as you mentioned) effect overall LLM enging inference performance. (It sure would be fun to have a couple of million dollars to set up a lab to test that.) The idea of a "Passmark / Geekbench for LLMs" has crossed my mind... and quickly filed somewhere between the "too hard" and "way outta my league". Someone needs to scratch that itch! Good luck?! For benchmarking LZ77/LZSS/LZMA compressors [https://github.com/inikep/lzbench](LZ77/LZSS/LZMA compressors) compiles ALL the compression algorthms into a single binary and runs/tests each in succession on the same hardware etc... thus eleminiating ALL variables other than the test subjects - the compression algos. Not the question you are asking... but perhaps a way to compare LLM engines/runners. |
Beta Was this translation helpful? Give feedback.
-
What I was thinking of is a set of standard benchmarks for hardware performance and then run benchmarks against arbitrary models to compare performance between them. Output that into json and allow it to be added to a website. I’m not quite sure how to deal with false information uploads and I don’t want to maintain a web server, so I’m currently leaning towards PR benchmark submissions. I currently have a basic poc in python using llama-cpp-python. While it looked like I could clear data between model loads, llama kept thinking it was in memory and eventually crashed. There’s also some concern that I’m more benchmarking the python library vs llama.cpp. I hate doing popens, but I might switch to that. |
Beta Was this translation helpful? Give feedback.
-
I’m currently working on a program to benchmark hardware that’s geared towards local LLMs. Right now I’m working on measuring tokens for different models. I’d like to add more generic benchmarks as well
What types of hardware performance do we care about relating to LLMs? So far I have:
What else do y’all think I should be focusing on?
Beta Was this translation helpful? Give feedback.
All reactions