Benchmarking: What types of performance does LLMs care about? #2038

soleblaze · 2023-06-28T14:09:45Z

soleblaze
Jun 28, 2023

I’m currently working on a program to benchmark hardware that’s geared towards local LLMs. Right now I’m working on measuring tokens for different models. I’d like to add more generic benchmarks as well

What types of hardware performance do we care about relating to LLMs? So far I have:

memory bandwidth
fp16 performance
Matrix math performance

What else do y’all think I should be focusing on?

SlyEcho · 2023-06-29T11:29:39Z

SlyEcho
Jun 29, 2023
Collaborator Sponsor

F16 used to be the focus for ML a long time, it and also BF16 had hardware support in GPUs added for just this reason. It cuts down on the memory use by half and also computations could be faster. It isn't well supported by CPUs still, I think for decades it was thought that we'd be moving from F32 to F64 more and more for increased precision. Even the old Intel 8087 coprocessor series from 1980 had 80-bit floating points for its internal registers.

The new thing now in ML and especially LLM models is going even lower than 16 bits, 8-bit quantization from the bitsandbytes library allowed LoRA training to finetune LLaMA for the first time on consumer GPUs.

Then came 4-bit with llama.cpp and GPTQ and others. Now even lower than 4 bits as well.

0 replies

ianscrivener · 2023-07-04T10:26:08Z

ianscrivener
Jul 4, 2023

@soleblaze - very interesting question!

There's a conversation in this repo about benchmarking llama.cpp releases to monitor overall performace in the codebase.

But what I haven't yet seen is discussion how different hardware and aspects of hardware (eg memory bandwidth as you mentioned) effect overall LLM enging inference performance. (It sure would be fun to have a couple of million dollars to set up a lab to test that.)

The idea of a "Passmark / Geekbench for LLMs" has crossed my mind... and quickly filed somewhere between the "too hard" and "way outta my league". Someone needs to scratch that itch! Good luck?!

For benchmarking LZ77/LZSS/LZMA compressors [https://github.com/inikep/lzbench](LZ77/LZSS/LZMA compressors) compiles ALL the compression algorthms into a single binary and runs/tests each in succession on the same hardware etc... thus eleminiating ALL variables other than the test subjects - the compression algos. Not the question you are asking... but perhaps a way to compare LLM engines/runners.

1 reply

Green-Sky Jul 4, 2023
Collaborator

if you dont need results that make sense, you dont need the model files. just random initialize all the tensors, and run it.

soleblaze · 2023-07-04T17:50:49Z

soleblaze
Jul 4, 2023
Author

What I was thinking of is a set of standard benchmarks for hardware performance and then run benchmarks against arbitrary models to compare performance between them. Output that into json and allow it to be added to a website. I’m not quite sure how to deal with false information uploads and I don’t want to maintain a web server, so I’m currently leaning towards PR benchmark submissions.

I currently have a basic poc in python using llama-cpp-python. While it looked like I could clear data between model loads, llama kept thinking it was in memory and eventually crashed. There’s also some concern that I’m more benchmarking the python library vs llama.cpp. I hate doing popens, but I might switch to that.

4 replies

ianscrivener Jul 4, 2023

Have you looked at abetlen/llama-cpp-python? That is a very mature python library that interfaces with llama.cpp and has already solved many of the challenges.

While I don't understand the incracies of the llama.cpp C++ code... I am pretty sure that "hardware performace" would need into a a LOT (100x) more complexity than say just (i) CPU performance, (ii) GPU performance, (iii) disk IO performance and (iv) memory speed...

soleblaze Jul 6, 2023
Author

yes, that's what I'm using. Looks like #2062 fixed my model loading issues. I was able to finish a run against a directory.

The idea with the hardware performance testing is to get an idea of what the hardware is capable of and not a reflection of what llama.cpp is capable of. I see the hardware manufacturer's specs listed as what's possible, but I want to actually be able to measure that. What's the best case given a simplistic workload that's designed to run as fast as possible. Benchmarking llama.cpp itself isn't too difficult. Create a set of standard prompts, standard models, and use the same seed. Run it X number of times and report the statistics on the time values llama.cpp reports.

Linpack is the benchmark that's used for the supercomputer top500 list. The results from that pretty much dictate if a supercomputer is "Exascale". There's also HPL-MxP, and HPCG. While I don't think these benchmarks necessarily map to how ggml works, it'd still be one metric that can be used to compare hardware.

EDIT: oh, and the other interesting part is choosing what models to use as the standard. OpenLlama would be my first choice. It has 3M, 7M, and 13M models. It would be nice to get larger models in the mix, but I'm unsure about using models that have a restricted license like llama.

KerfuffleV2 Jul 7, 2023
Collaborator

OpenLlama would be my first choice. It has 3M, 7M, and 13M models. It would be nice to get larger models in the mix, but I'm unsure about using models that have a restricted license like llama.

What the other person was saying is if you're just benchmarking then it really doesn't matter what's in the model's tensors. You can just fill them with random garbage. The content of the tensors also shouldn't affect evaluation speed, so that's fine for benchmarking.

So if you want to test the performance of a model without worrying about stuff like copyrights then you only need to construct a model with the correct architecture and you can fill the actual tensors with anything. There's no copyright for model structure (otherwise stuff like OpenLLaMA which uses the LLaMA architecture wouldn't be possible).

soleblaze Jul 7, 2023
Author

Oooh. Ok, I get it now. That’s a great idea. Could generate a model on the fly and not have to worry about storage/bandwidth either. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking: What types of performance does LLMs care about? #2038

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Benchmarking: What types of performance does LLMs care about? #2038

soleblaze Jun 28, 2023

Replies: 3 comments · 6 replies

SlyEcho Jun 29, 2023 Collaborator Sponsor

ianscrivener Jul 4, 2023

Green-Sky Jul 4, 2023 Collaborator

soleblaze Jul 4, 2023 Author

ianscrivener Jul 4, 2023

soleblaze Jul 6, 2023 Author

KerfuffleV2 Jul 7, 2023 Collaborator

soleblaze Jul 7, 2023 Author

soleblaze
Jun 28, 2023

Replies: 3 comments 6 replies

SlyEcho
Jun 29, 2023
Collaborator Sponsor

ianscrivener
Jul 4, 2023

Green-Sky Jul 4, 2023
Collaborator

soleblaze
Jul 4, 2023
Author

soleblaze Jul 6, 2023
Author

KerfuffleV2 Jul 7, 2023
Collaborator

soleblaze Jul 7, 2023
Author