-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in situ auto-Frankenmerges #4718
Comments
Also, if you're interested in implementing other ideas for "maximizing compute" on a single model: I'm interested in seeing what happens when you iteratively compute the same layer multiple times, but weigh the change of the hidden state proportionally. For example, doing 4 passes of each hidden layer where it has a 0.25x weight of change to the hidden state for each 'pass', and so on. |
We can implement a tool similar to Regarding the evaluation of a single layer multiple times, I think we can add a general-purpose solution via an optional integer array in the GGUF meta data that specifies the indices of the layers that need to be evaluated. This way, the layer loop: for (int il = 0; il < n_layer; ++il) { would become: for (int iil = 0; iil < n_layer; ++iil) {
const int il = model.layer_order ? model.layer_order[iil] : iil; This would be general enough to implement any kind of layer repetition and would be flexible to re-configure via the KV overrides. |
@semiring Do you still have interest in pursuing this concept? It would be interesting to get a smaller lora adapter for each finetuned model and apply it at inference time to save VRAM instead of loading redundant layers into memory. |
@kalomaze I don't have the cycles to work on a properly-engineered solution for this right now; if you're interested, please go ahead! |
I created a branch with it at https://github.com/xhedit/llama.cpp/tree/xhedit-layer-order. I added a std::string for the param to llama.h and as a result, test-c fails to build. I deleted the test from Makefile in my branch, so it's not suitable for merging until I come up with a way to keep llama.h compatible with C.
Gives me a decent reply.
Gives same result as
|
that's super interesting. merge isn't super easy to use |
Simply changing the order in the layer loop |
@ggerganov Is this being pursued? I started to try and do a GGUF merge with gguf.py, but I immediately hit: Working directly on quantised models seems to make the most sense, as probably no one will be running large merged models at F16 |
I think there is some work started in #5741 Regarding the error, I think you are using Implementing this in C using |
I used exllamaV2 for layer merging so far. The issue is when the model shares weights over duplicated layers, and there is KV cache for all layers including duplicates. The model flow might have to bounce back and forth between the cards. For exllamav2, python is great, as you can dynamically modify the layers, with just a quick cache rebuild after a modification. I don't think it makes to do that in C++ for inferencing. |
Feature Description
Modify llama.cpp to support on-the-fly "Frankenmerging" of the model in memory with itself.
Motivation
Frankenmerges, including auto-Frankenmerges, are becoming increasingly popular and appear to have properties that merit further study; it's Rich Sutton's "bitter lesson" in the small: stacking more decoder blocks means a greater total amount of computation in a single inference pass and, perhaps surprisingly, under the right circumstances, that greater accessible computation outweighs the 'noise' induced by performing fairly brutal surgery on the order of decoder blocks.
Right now experimentation is taking place at the level of building new models with mergekit. This is slow. The ability to mix-and-match decoder blocks on the fly in llama.cpp would speed up iteration and experimentation, helping better understand the tradeoff between greater available net computation and decoder surgery induced noise.
Possible Implementation
Something like this:
https://github.com/semiring/IRL-llama.cpp/blob/master/llama.cpp#L4346
The text was updated successfully, but these errors were encountered: