You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically a suggestion for a tool to automate finding the most ideal --rope-freq-base for a given configuration (eg context.) Effectively an automated llama-perplexity with a binary search comparing values.
Essentially my idea is a tool somewhat like llama-perplexity, but to help automate an otherwise rather painful manual process (though it would still be somewhat manual for simplicity's sake) it would run through repeatedly doing different values (a simple binary search seems to do) and checking one chunk of perplexity (as far as I can tell in each of my tests, testing only the first chunk is sufficient to dial down. The user can always do a full perplexity test afterwards) to find the RoPE that produces the lowest PPL. The reasons llama-perplexity itself isn't ideal for this are first, it must load and unload over and over for each test (even when only testing one chunk!) which can take quite some time on its own depending on the model, and second, it's quite a lot of work testing values and easy to lose track along the way (I have to keep a notepad handy as I go writing down numbers because my terminal is positively filled with text of which I need only one single value.) Even fast processes like smaller models or small contexts running 100% on GPU can still start to take quite some time and it gets very easy to lose track of which RoPE base yielded which value unless one keeps good notes.
I mean for it to be a relatively simple implementation. Much like with llama-perplexity you provide it a model, test data, context size (default to model size -- some models are misconfigured so you may still need to test the internal value,) RoPE scale (should default to 1 since normally 1 is best apparently,) two values providing a range of RoPE bases to test, and then finally perhaps it should also specify a maximum number of tries which would be optional (default to continue until reaching a value within 0.1 but one may not want to wait that long.) The key thing is mostly that it would keep everything loaded and continue testing on its own doing a binary search between the two values specified until finding the best first chunk PPL rather than manually loading over and over and manually putting in different values until finding it oneself. This would save a searcher an enormous amount of time and since it's basically just a slightly automated version of llama-perplexity, I think it should be really easy to implement (but sadly I am not a programmer.) I think it could potentially make finding the ideal value easy enough to also make it more viable for less tech-savvy users as well. Ideally it would print the values it finds at each RoPE base on the screen so the user can see if it possibly misses something. (It's weird, but the closer you get to the ideal value, the more a small change from said ideal makes it get worse!)
As a little background, GradientAI recently released a paper suggesting a formula that produced different RoPE values than what most people had been using with apparently much more optimal results. Several people ran tests on various models and generally got better perplexity with a few exceptions (most especially Mistral/Solar style models.) While I was running tests myself, I found that sometimes the most ideal value wasn't that either, but often something a little in between or even just plain different. Sometimes it even seems to vary by model. Apparently hardware differences may also come into play (though I'm not sure exactly how since I get the same number with all CPU and all GPU -- both are AMD however, so it might be more a matter of the libraries than the hardware.) Just some examples I've found (these are full PPL tests, first chunk tests only only good for honing in on a value):
Fimbulvetr (11B) at 16K would use 65536.0 in the old system which gives me 5.3148 +/- 0.03034. GradientAI's calculation (once we adjusted for the type of model) turned out to be probably 44448.0 which yields 5.1547 +/- 0.02914, which is indeed a decent improvement, but then we pulled 47661.5 out of the L2 table (L2 @ 12K) which produced 5.1389 +/- 0.02908 -- a fairly significantly better number.
But then I get odder ones. I tried Unholy V2 (a LLaMA2 13B) and at the old value of 65536.0 it yields 4.8980 +/- 0.02657, but the GradientAI formula seems to calculate 71738.4 which gives me 4.9153 +/- 0.02656 (actually worse this time! Most are better, but this one decided to be finicky.) After a lot of testing I found on my hardware 68000 (I typed it in on a joke -- now I can say Unholy runs best on a Motorola) gave me 4.8783 +/- 0.02638 which isn't a huge difference, but when you're extending a model to 16K those smaller difference start to add up more (at least IMO.) But another 13B I tested
I don't have full PPL tests for them atm, but a couple of LLaMA3 models I've experimented with have yielded RoPE base vaules of 180500 and 2000000 as yielding lowest PPLs (GradientAI's formula puts them at 1776948.1 which is at least closer than the old value of 1638400, but again it may also come to hardware differences.) These are 8B or 8B-based models, so I can't speak for big models, but they seem to be very bad at stretching to 16K even though that is only 2x for them (vs 4x for LLaMA2 models) with some yielding PPL values almost up to 7 for a Q6_K at ideal RoPE and even those that yield good PPL values seem to lose logic faster (ok, that bit is less objective, but I swear that they do,) so finding that ideal value is even more necessary now with the current generation of models. What's more, MoE models often tend to require different values, so it's even harder to find the most ideal.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Basically a suggestion for a tool to automate finding the most ideal --rope-freq-base for a given configuration (eg context.) Effectively an automated llama-perplexity with a binary search comparing values.
Essentially my idea is a tool somewhat like llama-perplexity, but to help automate an otherwise rather painful manual process (though it would still be somewhat manual for simplicity's sake) it would run through repeatedly doing different values (a simple binary search seems to do) and checking one chunk of perplexity (as far as I can tell in each of my tests, testing only the first chunk is sufficient to dial down. The user can always do a full perplexity test afterwards) to find the RoPE that produces the lowest PPL. The reasons llama-perplexity itself isn't ideal for this are first, it must load and unload over and over for each test (even when only testing one chunk!) which can take quite some time on its own depending on the model, and second, it's quite a lot of work testing values and easy to lose track along the way (I have to keep a notepad handy as I go writing down numbers because my terminal is positively filled with text of which I need only one single value.) Even fast processes like smaller models or small contexts running 100% on GPU can still start to take quite some time and it gets very easy to lose track of which RoPE base yielded which value unless one keeps good notes.
I mean for it to be a relatively simple implementation. Much like with llama-perplexity you provide it a model, test data, context size (default to model size -- some models are misconfigured so you may still need to test the internal value,) RoPE scale (should default to 1 since normally 1 is best apparently,) two values providing a range of RoPE bases to test, and then finally perhaps it should also specify a maximum number of tries which would be optional (default to continue until reaching a value within 0.1 but one may not want to wait that long.) The key thing is mostly that it would keep everything loaded and continue testing on its own doing a binary search between the two values specified until finding the best first chunk PPL rather than manually loading over and over and manually putting in different values until finding it oneself. This would save a searcher an enormous amount of time and since it's basically just a slightly automated version of llama-perplexity, I think it should be really easy to implement (but sadly I am not a programmer.) I think it could potentially make finding the ideal value easy enough to also make it more viable for less tech-savvy users as well. Ideally it would print the values it finds at each RoPE base on the screen so the user can see if it possibly misses something. (It's weird, but the closer you get to the ideal value, the more a small change from said ideal makes it get worse!)
As a little background, GradientAI recently released a paper suggesting a formula that produced different RoPE values than what most people had been using with apparently much more optimal results. Several people ran tests on various models and generally got better perplexity with a few exceptions (most especially Mistral/Solar style models.) While I was running tests myself, I found that sometimes the most ideal value wasn't that either, but often something a little in between or even just plain different. Sometimes it even seems to vary by model. Apparently hardware differences may also come into play (though I'm not sure exactly how since I get the same number with all CPU and all GPU -- both are AMD however, so it might be more a matter of the libraries than the hardware.) Just some examples I've found (these are full PPL tests, first chunk tests only only good for honing in on a value):
Fimbulvetr (11B) at 16K would use 65536.0 in the old system which gives me 5.3148 +/- 0.03034. GradientAI's calculation (once we adjusted for the type of model) turned out to be probably 44448.0 which yields 5.1547 +/- 0.02914, which is indeed a decent improvement, but then we pulled 47661.5 out of the L2 table (L2 @ 12K) which produced 5.1389 +/- 0.02908 -- a fairly significantly better number.
But then I get odder ones. I tried Unholy V2 (a LLaMA2 13B) and at the old value of 65536.0 it yields 4.8980 +/- 0.02657, but the GradientAI formula seems to calculate 71738.4 which gives me 4.9153 +/- 0.02656 (actually worse this time! Most are better, but this one decided to be finicky.) After a lot of testing I found on my hardware 68000 (I typed it in on a joke -- now I can say Unholy runs best on a Motorola) gave me 4.8783 +/- 0.02638 which isn't a huge difference, but when you're extending a model to 16K those smaller difference start to add up more (at least IMO.) But another 13B I tested
I don't have full PPL tests for them atm, but a couple of LLaMA3 models I've experimented with have yielded RoPE base vaules of 180500 and 2000000 as yielding lowest PPLs (GradientAI's formula puts them at 1776948.1 which is at least closer than the old value of 1638400, but again it may also come to hardware differences.) These are 8B or 8B-based models, so I can't speak for big models, but they seem to be very bad at stretching to 16K even though that is only 2x for them (vs 4x for LLaMA2 models) with some yielding PPL values almost up to 7 for a Q6_K at ideal RoPE and even those that yield good PPL values seem to lose logic faster (ok, that bit is less objective, but I swear that they do,) so finding that ideal value is even more necessary now with the current generation of models. What's more, MoE models often tend to require different values, so it's even harder to find the most ideal.
Beta Was this translation helpful? Give feedback.
All reactions