-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model test request #20
Comments
Very cool! I'll try to put it through its paces today. Can I ask what the merge params were? |
Btw are you able to upload the full weights? Lm-eval (what I use for the MAGI metric) is super slow through llama.cpp |
No full weights. I figured out how to model merge directly from EXL2 to avoid requantization; this can also be done dynamically (this is my reddit post on it), but it needs a modified version of ooba. I am also working together with a llama.cpp developer on a PR for the same for GGUF. I also don't have the full weights, because I tried over 6000 permutation, and without dynamic merges it would be months of compute 😅 This particular merge gives me an average score of almost 84, using dynamic merging of EXL2. The static merges of EXL2 and GGUF give ~38.4. Not sure why it's slightly lower, I need to look into it. But when I started building the GGUF merging pipeline, I discovered the backend issue #16 If full weights are needed, I can try and make some, but I think it will take a few hours time to download, merge and upload to Huggingface. Before I start that, would using the EXL2 version be faster? https://huggingface.co/Infinimol/miiqu-exl2 |
Fair enough. Oh, that's cool that you can do dynamic merges! I don't think eleuther eval harness supports EXL2. With gguf it said it was going to take 73 hours (on a A100) so that's not gonna work :( If you are able to make the full weights, I'm sure others would make use of them as well if your model gets popular. |
OK, if it takes 73 hours, then I'll make them for you :) Did you get the eq-eval test done? |
Yeah it scored 83.17 with chatml via ooba. Do you think it would score higher if I used EXL2? |
I'm still trying to figure out why I get differences. I built my own evaluation system, and maybe its something useful for eq-bench v3. It works like this: so, I use a descriptive rating, i.e. for your system: Then, I run the LLM and only select the logits for the symbols 0-9. These 10 logit values can then be converted to probabilities that sum to 1 for just these values. Lastly, you can use these probabilities as weights to get continuous score. |
It might do, but its still lower than my dynamic merges. I will put my fork of exllamaV2 up, and add in the integration for EQ-Bench. Then you can test on that; it's all done, but I figured its too custom a case to make a PR. The huge benefit is that it reuses GPU weights! i.e. if you can run a 70B model, you can also run the 120B self-merge, as the weights are reused :) |
Oh I like that approach. Is that something you added to the eq-bench questions, or another test? I did try converting the test to work with eleuther harness using logprobs evaluation, which is more or less what you are describing except the targets were just the raw numbers like [emotion]: 0, etc. It didn't work very well. Maybe the text explanation plus your score aggregation method would produce better results.
That's pretty amazing. How far do you think this could be pushed, in terms of scaling up the layer reuse? If it only increases computation time of inference I would be curious to just crank up the layer count and see what happens. |
This was for my own test, but it would be straightforward to add to eq-bench on backends that support logits.
Yes, this only works if you add a description of the values. Otherwise, the model doesn't know the ranges, and the score ranges get compressed.
I've been trying this a lot, you generally can change creativity, but only a few merge variants (<0.1%) increased eq-bench scores so far. |
@sqrkl OK, f16 is up now, at: https://huggingface.co/Infinimol/miiqu-f16 Very curious to see how it does at MAGI... Can we really make models smarter without fine tuning? |
It seems at the very least you didn't make it less smart. Which seems to be a difficult thing with these frankenmerges. It's scored pretty much exactly the same as Miqu. I ran it through the creative writing benchmark as well, it scored 65.5 compared to 69.5 for mistral-medium. It was a bit prone to hallucinations with some of the prompts which is probably why the lower score there. |
I noticed you need to lower the temperature a lot to chat with it, but then it is seems much better than miqu. My 'hands on' testing is staring with a character card in ooba, and getting the model to write both sides of the conversation. With the model merges, I found it's necessary to raise minP and lower the temp to about 0.3. Most model merges degenerate into giggling fits. One character makes a weird mistake (repeats itself or leaves out spaces between words), and the other character finds that so hilarious, they both end up laughing hysterically. In some other merges, the characters starts alliterating or using huge streams of adjectives in every sentence (e.g. the gorgeous, greenish, graceful goose leapt loosely, lively, lovingly like the splendid, soulful shining sun...) But with lower temperature, this model is much more creative than base Miqu, while staying in character. I will give you the best settings later today. |
That's hilarious. Sounds like merging makes for an intoxicated miqu. People seem to like these big merges for writing more than anything else (after all they tend to do worse on benchmarks). I'm hoping the creative writing benchmark can capture this thing that other benchmarks seem to miss. |
Yes, it has a very 'stoned' vibe sometimes! Where the repeats come from in the model changes how the model behaves. The model infinimol/miiqu model has the least 'personality' change but improved the EQ_bench score. Other merges increased writing creativity. I plan on mixing the merges once I get some more compute.
Yep, it needs a lower temperature to stay on track. Will you do a write-up for MAGI? I don't plan on fine-tuning this model but only work with self-merges for the time being. So I would like to evaluate using MAGI before I submit another model. |
Anyway, I'm still pretty happy, highest opensource model without finetuning! I might write a short paper up on the topic, referring to this result. The EQ-Bench score was lower than in my tests. Using dynamic merges in ooba with 4 repeats, I scored: [83.36, 84.42, 83.92, 84.3]. I'll investigate why the scores are lower. Maybe my merging code messed something up for the static merge. |
That's definitely a success! It's evidently not easy to create merges without being destructive to reasoning (and benchmarks). I'm happy to redo the eq-bench score; I just want to make sure it's reproducible to people who are downloading weights, i.e. not having to use non-standardised inferencing methods (like dynamic merges). I do actually have a write-up of the MAGI subset in the works. If you are inclined I would value any feedback on it (or otherwise just read at your leisure): https://docs.google.com/document/d/1A2KTDHXX7Qyuwd5HBKiZl0Pg7QpIOMRqKAZg94l3FbI/edit |
That is super cool. Is this fork public? I know some people who do a lot of merges who would be interested in playing with this. |
Not yet, I'll tidy up and push it this week, if you want to play with it.
Had a look, seems very relevant! Any chance you could go further, and generate an absolute minimal number of tests for a first pass? i.e. 10 tests that if you pass more than five, you are not totally useless? That would be very useful for sorting through thousands of model that get 'dumber' with merging. |
Mm maybe. Tinybenchmarks is doing this: https://arxiv.org/abs/2402.14992 It may be possible to push it even futher towards an ultra small test set which is maximally discriminative. But there are big tradeoffs for reducing the test set down to this level. but yeah if you are just wanting a quick indicataor, tinybenchmarks is a good place to start. |
@sqrkl Just wrote you by email about this thread! for Dynamic Merges, pull my repo: https://github.com/dnhkng/text-generation-webui load the model with Exllamav2, not Exllamav2_HF, and once loaded, you can set the section of layers you want repeated. you set 'start' and 'stop' positions and the number of 'repeats', and the model will generate a merged model of the form:
|
So cool! I'm going to try this out. |
Oh and do you mind if I share this? |
Not sure if you are familiar with this: https://github.com/huggingface/hf_transfer Very handy for 10x download speeds. Can you give some tips on using LM_eval_harness? Im not sure how you used it for MAGI, e.g. did you use bitsandbytes? Could you share the parameters to start the tests? |
Yep love using hf_transfer to max out the bandwidth of those runpods. :) lm-eval can be a bit fiddly to get working. Here's the list of stuff I paste into a runpod to get it working from scratch:
And the lm-eval command:
You can load in 8bit or full weights if you have the vram for it. The --log_samples saves the full output of the test including all the model's answers. Unnecessary for most uses but I find it's helpful for debugging sometimes. The sqlite cache allows the test to retry if it fails. Batch size: I usually start by setting it to auto:9, which means it will recalculate the optimal batch size 9 times as it goes along. Since lm-eval orders the test set by size (largest first), the max batch size will start out small and get bigger, so if it's going to be a long eval time it pays to have the batch size automatically resize. The downside is that it often gets it wrong and you end up with out of memory errors. In that case you have to set it manually and use trial & error. I've read that batch sizes > 1 can affect score negatively, but at least in my limited comparisons it's been neglible. Running lm-eval with batch size 1 is suuuuper slow. You can run it with llama.cpp server, but it's slow because it forces batch size 1. But that might be fine if it's running locally and you can just leave it overnight. The command looks something like this:
|
Now that the llama.cpp server is running correctly, would it be possible to have this model tested?
https://huggingface.co/Infinimol/miiqu-gguf
using ChatML format and context length >= 1024, please :)
It is a model I been working on it for some time, and I think it's interesting. It is not a fine-tune, but a merge, and I find it consistently scores higher than the base model (miqu), which I think is a first for a pure merge model. Eq-bench runs in about 15 mins on an A100.
The model is GGUF, but split to fit under the 50Gb limit on Huggingface, but the model card give the one-liner to reassemble the file.
The text was updated successfully, but these errors were encountered: