in situ auto-Frankenmerges #4718

semiring · 2023-12-31T15:46:56Z

Feature Description

Modify llama.cpp to support on-the-fly "Frankenmerging" of the model in memory with itself.

Motivation

Frankenmerges, including auto-Frankenmerges, are becoming increasingly popular and appear to have properties that merit further study; it's Rich Sutton's "bitter lesson" in the small: stacking more decoder blocks means a greater total amount of computation in a single inference pass and, perhaps surprisingly, under the right circumstances, that greater accessible computation outweighs the 'noise' induced by performing fairly brutal surgery on the order of decoder blocks.

Right now experimentation is taking place at the level of building new models with mergekit. This is slow. The ability to mix-and-match decoder blocks on the fly in llama.cpp would speed up iteration and experimentation, helping better understand the tradeoff between greater available net computation and decoder surgery induced noise.

Possible Implementation

Something like this:

https://github.com/semiring/IRL-llama.cpp/blob/master/llama.cpp#L4346

kalomaze · 2024-01-01T01:22:54Z

In typical Frankenmerge setups (like Goliath 120b pictured here) I notice a pattern:

Compute X amount of layers
Then switch models, go back Y amount of layers [from where X was at], and recompute X amount of layers from there and repeat the cycle

If you wanted to emulate the frankenmerging pattern seen in Goliath on a single 70b model at inference time, it would be:

Compute 16 layers, rewind to the last 8 layer point and compute another 16 layers worth after that

You could have two experimental hyperparameters for this:

recompute_n_layers [which determines how frequently it 'rewinds' and computes layers]
rewind_n_layers [which determines how many layers to 'rewind' in the forward pass]

Or something along those lines.

If you wanted to do it with two models at inference time, you could make it "switch" models every time it recomputes which would completely emulate the frankenmerging setup.

kalomaze · 2024-01-01T02:02:14Z

Also, if you're interested in implementing other ideas for "maximizing compute" on a single model:

I'm interested in seeing what happens when you iteratively compute the same layer multiple times, but weigh the change of the hidden state proportionally. For example, doing 4 passes of each hidden layer where it has a 0.25x weight of change to the hidden state for each 'pass', and so on.

ggerganov · 2024-01-02T10:34:59Z

We can implement a tool similar to quantize that takes 2 GGUF files and outputs a new GGUF file picking and merging certain layers from the input files.

Regarding the evaluation of a single layer multiple times, I think we can add a general-purpose solution via an optional integer array in the GGUF meta data that specifies the indices of the layers that need to be evaluated. This way, the layer loop:

for (int il = 0; il < n_layer; ++il) {

would become:

for (int iil = 0; iil < n_layer; ++iil) {
    const int il = model.layer_order ? model.layer_order[iil] : iil;

This would be general enough to implement any kind of layer repetition and would be flexible to re-configure via the KV overrides.

kalomaze · 2024-01-12T02:57:15Z

@semiring Do you still have interest in pursuing this concept? It would be interesting to get a smaller lora adapter for each finetuned model and apply it at inference time to save VRAM instead of loading redundant layers into memory.

semiring · 2024-01-12T03:25:47Z

@kalomaze I don't have the cycles to work on a properly-engineered solution for this right now; if you're interested, please go ahead!

xhedit · 2024-01-16T16:10:02Z

I created a branch with it at https://github.com/xhedit/llama.cpp/tree/xhedit-layer-order. I added a std::string for the param to llama.h and as a result, test-c fails to build. I deleted the test from Makefile in my branch, so it's not suitable for merging until I come up with a way to keep llama.h compatible with C.

./main --layer-order "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,20,21,22,23,24,25,26,27,28,29,30,31]" --model ../models/Hermes-7B-q8.gguf --prompt "Hi." --temp 0

Gives me a decent reply.

./main --layer-order "[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]" --model ../models/Hermes-7B-q8.gguf --prompt "Hi." --temp 0

Gives same result as

 ./main --model ../models/Hermes-7B-q8.gguf --prompt "Hi." --temp 0

sorasoras · 2024-01-16T16:34:44Z

that's super interesting. merge isn't super easy to use

jxy · 2024-02-14T03:26:20Z

Simply changing the order in the layer loop build_llama is not enough. If there is a layer applied twice, the same KV cache is used. We need to allocate additional KV cache for those repeated layers.

dnhkng · 2024-02-28T17:07:01Z

We can implement a tool similar to quantize that takes 2 GGUF files and outputs a new GGUF file picking and merging certain layers from the input files.

@ggerganov Is this being pursued? I started to try and do a GGUF merge with gguf.py, but I immediately hit:
ValueError: Only F32 and F16 tensors are supported for now

Working directly on quantised models seems to make the most sense, as probably no one will be running large merged models at F16

ggerganov · 2024-02-29T07:58:59Z

I think there is some work started in #5741

Regarding the error, I think you are using gguf-py? It does not seem to support quantized tensor info. Not sure how difficult it would be to add.

Implementing this in C using ggml would make more sense to me

dnhkng · 2024-02-29T21:16:11Z

I used exllamaV2 for layer merging so far. The issue is when the model shares weights over duplicated layers, and there is KV cache for all layers including duplicates. The model flow might have to bounce back and forth between the cards.

For exllamav2, python is great, as you can dynamically modify the layers, with just a quick cache rebuild after a modification. I don't think it makes to do that in C++ for inferencing.

semiring added the enhancement New feature or request label Dec 31, 2023

ggerganov added the good first issue Good for newcomers label Jan 2, 2024

sorasoras mentioned this issue Feb 27, 2024

WIP: Add model merge example #5741

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in situ auto-Frankenmerges #4718

in situ auto-Frankenmerges #4718

semiring commented Dec 31, 2023

kalomaze commented Jan 1, 2024 •

edited

Loading

kalomaze commented Jan 1, 2024

ggerganov commented Jan 2, 2024

kalomaze commented Jan 12, 2024

semiring commented Jan 12, 2024

xhedit commented Jan 16, 2024 •

edited

Loading

sorasoras commented Jan 16, 2024

jxy commented Feb 14, 2024

dnhkng commented Feb 28, 2024 •

edited

Loading

ggerganov commented Feb 29, 2024

dnhkng commented Feb 29, 2024

in situ auto-Frankenmerges #4718

in situ auto-Frankenmerges #4718

Comments

semiring commented Dec 31, 2023

Feature Description

Motivation

Possible Implementation

kalomaze commented Jan 1, 2024 • edited Loading

kalomaze commented Jan 1, 2024

ggerganov commented Jan 2, 2024

kalomaze commented Jan 12, 2024

semiring commented Jan 12, 2024

xhedit commented Jan 16, 2024 • edited Loading

sorasoras commented Jan 16, 2024

jxy commented Feb 14, 2024

dnhkng commented Feb 28, 2024 • edited Loading

ggerganov commented Feb 29, 2024

dnhkng commented Feb 29, 2024

kalomaze commented Jan 1, 2024 •

edited

Loading

xhedit commented Jan 16, 2024 •

edited

Loading

dnhkng commented Feb 28, 2024 •

edited

Loading