WIP: Add model `merge` example #5741

ngxson · 2024-02-26T22:11:32Z

I don't know if it's a good idea or not.

Still WIP, not tested, would be nice if some one can test it out.

usage: ./merge ./path/model_1 CONFIG1 ./path/model_2 CONFIG2 ./path/output

  CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12
  Optionally, you can specify the scaling for a range of layers, for example: 0-5*0.5,6-7*1. By default, scale will be 0.5. The number of layer start counting from 0.
  The embedding layer of the first model will be used
  NOTE: currently, only F16 model type is supported

sorasoras · 2024-02-27T11:24:51Z

#4718 (comment)

For this Pr, I think in addition to merge two model, It should also add feature to evaluation of a single layer multiple times.
Just reconfigure the same gguf.

ngxson · 2024-02-27T20:41:00Z

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

sorasoras · 2024-02-29T08:57:02Z

@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata).

The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future.

That's fair, but I was thinking changing metadata is easier to implement and test on existing models.
It's harder to know what work or not when franklin merge different model.
Anyway, Thanks for the hard work.

dnhkng · 2024-02-29T09:55:59Z

I would be interesting in layer interleaving. Is this only for merging layers' weight linearly? Or can it do pass through?

Also this line is not entirely clear:
CONFIG must be in format: p0-p1,p2-p3,p4,... Example: 0-5,7,8-12
It looks sequential, and only one config is given, so it's not clear what the second model's config should look like.
If one mode has: 0-5,7,8-12, what should the config of the other model be? the gaps?

Most frankenmerges for passthough are done so:

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 20]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 30]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [20, 40]
    model: 152334H/miqu-1-70b-sf
...

Can this kind of repeat of blocks be done with this code?

ngxson · 2024-02-29T10:18:18Z

@dnhkng Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

This PR only aims to merge the weight linearly, meaning it does not add or remove any layers to the merged model.

One thing I don't understand in the lazy merge kit format though, can you please clarify it?: does the interleaving means some layers are repeated (for example, [0-20] + [10-30] results in [0-10] + [10-20] + [10-20] + [10-30])

Thank you in advance.

ngxson · 2024-02-29T10:21:42Z

Yeah in fact I have a typo error in 0-5,7,8-12, it should be 0-6,7,8-12

It's true that the logic for my CONFIG argument is not correct. In fact, it should always be used with the "scale". For example, if I want to take 0-7 from model A and 8-12 from model B:

CONFIG1 = 0-7*1,8-12*0
CONFIG2 = 0-7*0,8-12*1

But I'm planning to re-design the whole thing though, to prepare support for the "repeated layers" option

dnhkng · 2024-02-29T10:45:28Z

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 10]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [5, 15]
    model: 152334H/miqu-1-70b-sf
- sources:
  - layer_range: [10, 20]
    model: 152334H/miqu-1-70b-sf
...

This would result in:
0,1,2,3,4,5,6,7,8,9,5,6,7,8,9,10,11,12,13,14,10,11,12,13,14,15,16,17,18,19...

This is why Frankenmerge models are larger than base models.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer!
i.e. We want this particular output from 2 models ( for one model, we could just use it again as the second model), which we'll call 'a' , and 'b' for brevity. We want to use a mixture of interleaving and layer merging, to get this final output. In this case, the first 3 layers are from model a, the forth is a mix of model a+b, and the next few layers repeat layers from model b:
[a0, a1, a2, a3*0.5+b3*0.5, b4, b5, b6, b5, b6, b7]

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be:
model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

As both configs must be the same length, for model_a we used 0-5*0 as filler at the end.
Does that make sense?

ngxson · 2024-02-29T11:41:28Z

Thanks for the explanation.

This is why Frankenmerge models are larger than base models.

According to discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Personally, I would be interesting in a hybrid approach, with the ability to merge and layer!

Trying to stay with your parameter notation, the closest I could get for the 2 configs would be: model_a 0-2*1,3*0.5,0-5*0 model_b 0-2*0,3*0.5,4-6*1,5-7*1

Having both merge + repeated layers is great. But for that, I think the whole notation that I invented 0-2*1,3*0.5,0-5*0 is just far too limited. I propose more readable syntax (written to a file) like:

a0*1 + b0*0
a0*1 + b0*0
a1*0 + b1*1

The file above results in output model having:

Layer 0: Model A layer 0
Layer 1: Model A layer 0
Layer 2: Model B layer 1

It's not as robust as lazy merge kit syntax (yml), but give us more space to improve in the future.

Additional, someone can easily write a python script to convert lazy merge kit yml to my syntax.

What do you think about this approach?

dnhkng · 2024-02-29T12:19:09Z

Sure, I think we should do it. I was about to start testing Mergekit now, but I can quickly switch gears and write Python converter script.

According to the discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.

Yes, that would be a better method. I have a large model I know quite well I've merged manually in ExllamaV2.It took a bit to sort out KV caching though, and there are issues when the model spans multiple GPUs. At first, I would just duplicate.

If you can generate the merging code, I can compare the results of your method to the measured result of my merge.

Update: I could write the Python converter, but now that I look in more detail, I think the layer-by-layer method here is much more powerful. Mergekit only allows either slice interleaving OR linear/spherical interpolation of all layers. The config model you describe is more verbose, but much more powerful. I would prefer that TBH.

TBH, there are two options, 1) easy parsing with just 3 values:

model-a layer, model-b layer, weight of model-a
0,0,1
0,0,1
1,1,0
2,2,0.5

Or YAML, and give all the details:

sources:
  - model-a: 152334H/miqu-1-70b-sf
  - model-b: 152334H/other-model-b-70b-sf
  - model-c: 152334H/other-model-c-70b-sf      # we can then add as many models as we want
layers:
  - 1:
    model-a:
       layer_source:1
       weight:0.5
    model-b:
       layer_source:1
       weight:0.5
    method:linear               # and offer various interpolation methods
  - 2:
    model-a:
       layer_source:2
       weight:0.0
    model-b:
       layer_source:2
       weight:1.0
    method:linear
  - 3:
    model-a:
       layer_source:3
       weight:0.3
    model-b:
       layer_source:5
       weight:0.3
    model-3:
     layer_source:5
     weight:0.4
    method:slerp
  - 4:
    model-a:
       layer_source:4
       weight:1.0
    method:none               # and do straight passthrough of a single layer if needed

ngxson · 2024-02-29T12:53:05Z

Thanks for the input, I'll need to rework this PR in the next days.

Regarding the format, I still having ability to specify weight of a and b separately can be interesting. I don't know what will happen if we take weightA*0.5 + weightB*0.6 for example (so the total weight becomes 1.1). It's also useful when you merge 3 models, the first pass can have weightA*0.33 + weightB*0.33 then second pass + weightC*0.33

The csv format should simplify the cpp parser code though, I'll consider that.

YML format is readable, but unfortunately we can never include a yml parser in llama.cpp.

However, having it as the input of your python script (and the python convert that yml into csv or something llama.cpp can understand) will be very useful.

dnhkng · 2024-02-29T13:44:15Z

Yes, the YAML could be converted to CSV easily, if we leave out various interpolation types.

For completeness, I would explicitly put in all weights, and normalise to reach a sum of 1.0
i.e. for two models:

model-a layer, model-b layer, weight of model-a, weight of model-b
0,0,1.0,0.0
0,0,1,0.0.0
1,1,0.0,0.0
2,2,0.5,0.5

and for three models:

model-a layer, model-b layer, model-b layer, weight of model-a, weight of model-b, weight of model-c
0,0,0, 1.0,0.0, 0.0
0,0,0, 1.0,0.0, 0.0
1,1,1, 0.0,1.0, 0.0
2,2,2,0.5,0.5,0
3,3,3,0.3,0.3,0.3

The last layer here gets normalised to 1/3, 1/3, 1/3.

ngxson · 2024-03-01T13:57:21Z

@dnhkng I updated by PR to have the ability to:

Merge multiple models at once (not just 2 models)
Use the CSV format that we discussed

To simplify my CSV parsing code, I choose the column in order "model - scale - model - scale" (instead of "model - model - scale - scale"

0,1.0,0,0.0    meaning: output layer 0 = A[0]*1.0 + B[0] * 0.0
0,1.0,0,0.0    meaning: output layer 1 = A[0]*1.0 + B[0] * 0.0
1,0.0,2,0.0    meaning: output layer 2 = A[1]*0.0 + B[2] * 0.0
2,0.5,1,0.5    meaning: output layer 3 = A[2]*0.5 + B[1] * 0.5

If you add the third model, the columns become "model - scale - model - scale - model - scale"

I tried it myself and confirmed that the output model can be loaded, inference without any problem. What I could not verify is that the merging result (semantic result) is good or not (in other words, did it do A*scale + B*scale correctly or not). Can you verify this? Thank you!

ngxson · 2024-03-01T14:01:07Z

FYI, I was also thinking adding ability to merge quantized model, but at this stage it's quite tricky: I must dequantize it, do calculations with float then re-quantize it again. Currently I'm staying with single-thread model for simplification, but the whole "dequant-requant" thing should be done with multi-threading, too tricky for now.

dnhkng · 2024-03-01T14:24:51Z

Could you add a branch for pass-through (no linear interpolation) of quantized models?

I have a use case for that right now!

i.e. a single model quantized model, with repeating layers.

This issue is that, from my tests, model self-merging only starts to help from 34B models and up. At FP16, that's a huge amount of RAM required!

I have a model that is a positive outlier on a difficult LLM benchmark, so it should be relatively clear whether the merge worked. It's a 70B model, so I'll need to run the tests on an 80Gb GPU. Interpolating layers would be an added benefit in the future though!

I will pull your code and try on FP16 Llama7B now, but I know all outputs will be worse than the base model. However, I know regions of "really bad", and "slightly bad", so I can see if it is at least making sense.

ngxson · 2024-03-01T14:32:07Z

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is ability to re-use same tensor for repeated layer. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

dnhkng · 2024-03-01T14:34:52Z

Reusing layers makes sense, but the caching is tricky.

There's a discussion on my pull request for ExllamaV2 here: turboderp/exllamav2#275

dnhkng · 2024-03-01T14:37:01Z

I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me.

Also, just for my curiosity: if you merge the model then use ./quantize to re-quant it again, does that work for you? This way it takes a lot of disk space, but you'll eventually get a model small enough to fit into RAM.

One thing I'll try to work on is the ability to re-use same tensor for repeated layers. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal)

I can try Q4 -> FP16 and re-quantization. I'll keep watching this pull request, and test it when it's ready. Intermediate disk space is fine, I have a few SSD Tb free ;)

ngxson · 2024-03-01T14:40:13Z

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

P/s: I'm actually bad at math when I was in high school / university. Nowadays with all these machine learning stuff, I still imagine "tensor" to be "rubik cube" in my head

dnhkng · 2024-03-01T14:43:50Z

Reusing layers makes sense, but the caching is tricky.

Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong).

For example, when you have 2 consecutive layers having same weight W[0] == W[1], then KV[1] = W[1]*(W[0]*KV[0])

Yes, you can't share cache, it would get overwritten on the higher layer processing... But it still works! The results are worse though, but that's not unexpected. The fact that it even slightly works is crazy though.

I have done quite a lot of testing on various permutations of layers, and most are worse. but there are a few interesting combinations. GGUF would be the best way to share them, as going via FP16 torch tensors, then merging, then converting to GGUF and finally quantization seems like a lot of wasted effort! Better to experiment in ExllamaV2 dynamically and build and distribute in GGUF.

dnhkng · 2024-03-02T07:28:52Z

Tested it with a self-merge today on F16, and it looks good!
Models self-merge repeats I know are bad are also bad with your code, and good models also look good. Passes the first subjective tests :)

I will fire up an evaluation pipeline over the weekend, and do more extensive testing.

Just to clarify:
Does it do interpolation too, with quants? That would be amazing!

Also, Mergekit offers Spherical linear interpolation (SLERP). This seems to offer better merges. (brief description here).

ngxson · 2024-03-02T09:46:05Z

Thanks! Glad to know that it works in your test.

Only linear merging is supported for now. SLERP is interesting too and technically possible (because internally we dequantize all matrix to float). However I think I'll do that in later stage (or in another branch).

What's not clear for me though: SLERP works with vector, but we have matrix as model weight. How can SLERP apply to matrix? For example a matrix 4x4, will it be consider as a vector of 16 dimensions, or 4 vectors of 4 dimensions each?

dnhkng · 2024-03-02T10:01:08Z

What's not clear for me though: SLERP works with vector, but we have matrix as model weight. How can SLERP apply to matrix? For example a matrix 4x4, will it be consider as a vector of 16 dimensions, or 4 vectors of 4 dimensions each?

In PyTorch it seems straightforward. The implementation is here, from line 94:
https://github.com/arcee-ai/mergekit/blob/main/mergekit/merge_methods/slerp.py

I have just bought some cloud compute to test the merged model; I need 80Gb VRAM for it to run at a useful speed. It will take a few hours at least.

ngxson · 2024-03-02T10:09:36Z

Oh ok thanks for the info. Seems like in the python code, there is no place where the tensor view is changed to 1d. That mean it keeps one row of matrix == one vector.

I can wait, don't worry. I'm trying to refactor the re-quantization part in another PR, so we should get some more performance when having quantized model as output.

dnhkng · 2024-03-02T11:12:23Z

I can wait, don't worry. I'm trying to refactor the re-quantization part in another PR, so we should get some more performance when having quantized model as output.

Great! Im merging a 70B model, and its not super fast. Many layers are with a 1.0/0.0 weight ratio. Maybe as a backlog item, if a new layer has 100% weight from a model, skip dequantization, merging and re-quantization, and just pass through the layer with 100% weight. Not urgent though. It looks like the merge will take about 30 minutes.

ngxson · 2024-03-02T11:15:24Z

FYI, I've just pushed a refactor commit that has better multi-thread usage for re-quant operation (using same code as ./quantize tool). You'll now be able to utilize almost 100% CPU for doing re-quant.

ngxson · 2024-03-02T20:20:28Z

I had a look on mergekit + slerp today. I think I can add slerp in this PR, as it make more sense than linear method. However, I will need to re-invent my input format.

On the blog article, they target specifically some tensors, for example self_attn or mlp

parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5

The current CSV format does not allow specify scaling at tensor level. Therefore, I propose a new format which is inspired by assembly language:

---
all slerp 0,0,0.9
attn_output slerp 0,0,0.9
---
all linear 1,1,0.6,0.4
attn_output slerp 1,1,0.9
---
all linear 2,2,1.0,0.0
---
# repeat the first layer defined earlier in this file
repeat 0
---
repeat 1
---
...

Each --- means a new output layer
Then, each instruction is in format verb (space) tensor (space) arguments. Verbs that we can have now:
- linear with arguments in order of source_layer,source_layer,t
- slerp with arguments in order of source_layer,source_layer,scale,scale
- repeat to repeat a layer in the same output model
- Other methods like ties or dare can be added in the future. I also thought about copy which simply copy the layer from one of the source model to the output model
For simplicity, we will only allow merging 2 models for now (no more than 2)

I don't know if it's too complicated for your converter script @dnhkng ?

dnhkng · 2024-03-02T21:44:19Z

OK, the 70B Model merge looks interesting.

The merges go in the same direction I see with ExllamaV2, so I think everything is working OK!

I have one small issue, that I'm trying to figure out still though. I use EQ-Bench to test the models, and weirdly, using llama.cpp server I get significantly worse results than using exllama via oobabooga. The relative changes are all correct, but the absolute scores for the llama.cpp backend are about 75.5, using the original leaked Miqu Q4/5 weights. However, I get a score of 82.7 for the Q4 weights with exllamaV2! A 7 points difference here is massive.

This is extra weird, as the exllama weights are just the Miqu weights that have been de-quantized, converted and re-quantized, so you would not expect them to be so much better (I would expect them to be slightly worse). I've made an issue at the benchmark repo, but maybe someone here might know why this is the case.

@ngxson

I don't know if it's too complicated for your converter script @dnhkng ?

All good, write a config style you like, and I'll write up a python converter :)

So long as the format is sensible, it should be easy to generate a High-Level abstraction. The fallback is to write low-level by hand, for unusual cases. The combination is powerful.

ngxson · 2024-03-03T10:19:58Z

For the benchmark difference between llama.cpp server and exllama, apart from the chat template that I discuss in the other issue, maybe it's also because KV cache of llama.cpp is f16 by default. (Idk if exllama use f16 or bf16 or f32 for KV; pay attention that even model is quantized, the KV may not be quantized)

I'll start working on the slerp and the new input format today, as the current implementation already output an usable result.

ngxson · 2024-03-03T18:03:15Z

@dnhkng I added the new format and SLERP, it's slightly different that what I proposed above, but will be easier to understand:

output layer 0
all slerp 0,0,0.1
attn_output slerp 0,0,0.5

output layer 1
all linear 1,1,0.6,0.4
attn_output slerp 1,1,0.9

output layer 2
all copy 0,2

...

You can have a look at config.example.txt for a complete example.

I've tried merging a dolphin-mistral with vistral (mistral but finetuned to understand vietnamese). The output model does speak mixed eng-viet which indicate that my code kinda work. The used merge config is config.example.txt

Feel free to ask if something is not clear for you. Thank you!

dnhkng · 2024-03-03T18:07:15Z

I'll try and write a high level python configuration generator for the new format.

jukofyork · 2024-03-05T01:27:28Z

If anyone is interested, then I think we should in theory be able to get a better estimate of the original fp16 values for the Miqu model by combining the q_5, q_4 and q_2 quantized values.

I don't really know what criteria llama.cpp is using to quantize the values, but I assume it's to minimise the least squares error? If so then I think we can assume the values come from a normal distribution and then work out the correct weighting factor for the 3 different bin centres we have for every original fp16 value that was quantized.

This obviously won't work for some distributional assumptions, eg: if the original fp16 values came from a uniform distribution, then knowing which of the 4 bins the q_2 came from and which if the 16 bins the q_4 came from gives us no extra information over knowing which of the 32 bins the q_5 came from and the maximum likelihood estimate is still just the centre of the q_5 bin (assuming the bin boundaries all align anyway).

But I think the values are pretty likely to have come from an approximately normal distribution (especially due to all the layer norms in the model, etc) and the correct weightong factors should be findable either analytically or empirically.

Without explicitly working it out, I think the weights will likely be something like the #bins ratio squared (ie: using the conjugate prior formula), but I'm pretty sure it could be worked out empirically quite easily if we know the exact criteria the quantization is using.

It probably won't be a huge increase and at best be around the level of q_6, but it would likely be useful for those remerging the de-quantized fp16 model off huggingface.

jukofyork · 2024-03-05T02:05:41Z

Yeah, to work it out analytically looks quite hard:

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w40/Pouransari_Least_Squares_Binary_Quantization_of_Neural_Networks_CVPRW_2020_paper.pdf

but it wouldn't be hard to estimate the weights empirically as we could just simulate the forward quantization process used to create a q_5, q_4, and q_2 of a standard normal (using least squares criteria or whatever llama.cpp is using) and then find the optimal weighting factors to get the maximum likelihood estimate of the original fp16 value (or something similar anyway).

It may turn out to be a different weighting factor for each of the 32×16×4 combinations, but even this wouldn't be hard to find empirically via simulation.

ngxson · 2024-03-05T09:38:28Z

This obviously won't work for some distributional assumptions, eg: if the original fp16 values came from a uniform distribution, then knowing which of the 4 bins the q_2 came from and which if the 16 bins the q_4 came from gives us no extra information over knowing which of the 32 bins the q_5 came from and the maximum likelihood estimate is still just the centre of the q_5 bin (assuming the bin boundaries all align anyway).

I agree with that: since we're using qX_K and not qX_0 or qX_1, the difference between 16 bins of q4 and 32 bins of q5 is not that much. Throwing q2 into the equation may make it worse. I assume that dequantizing q5 is already the best result we can get.

ngxson · 2024-03-05T09:41:12Z

Btw @dnhkng I came across the code for merging embedding & output layers of mergekit, seems like it's also an important part to improve the quality of output model. I'll try to implement that in this week, but quite tricky because sometimes we have models with different vocab size (i.e. added special tokens)

dnhkng · 2024-03-05T17:07:04Z

Btw @dnhkng I came across the code for merging embedding & output layers of mergekit, seems like it's also an important part to improve the quality of output model. I'll try to implement that in this week, but quite tricky because sometimes we have models with different vocab size (i.e. added special tokens)

Will that mean a new format for the configuration?

ngxson · 2024-03-05T21:52:12Z

Will that mean a new format for the configuration?

No, don't worry, it will be just an additional (optional) command to add to the current format

dnhkng · 2024-03-09T21:35:37Z

OK, I've written a YAML parser that converts high-level config files to your format, including some quite complex merges.

ngxson#3

ngxson · 2024-03-09T21:55:30Z

@dnhkng Thank you! Seems good, I'll try it tomorrow

sorasoras · 2024-04-17T16:58:37Z

@dnhkng Thank you! Seems good, I'll try it tomorrow

haven't seem any progress, any update?

ngxson · 2024-04-18T02:36:34Z

Yeah sorry I was quite busy since then. The python converter script looks good, but merging this PR (the part that I made) into master is quite risky, since it's quite huge and I doubt if anyone find it helpful in the future.

For now, I think we can consider this PR as a demo. But you can feel free to let me know if you want to change something else.

ngxson added 3 commits February 23, 2024 21:54

wip: model merge

c86d5f2

sync

4858257

first working version

df9fb7e

ngxson added the help wanted Extra attention is needed label Feb 26, 2024

ngxson changed the title ~~Add model merge example~~ WIP: Add model merge example Feb 26, 2024

merge: missing output in help

b4a70fc

fix type

f50bf00

ggerganov mentioned this pull request Feb 29, 2024

in situ auto-Frankenmerges #4718

Open

ngxson added 3 commits February 29, 2024 14:48

Merge branch 'master' into xsn/model_merge

b6da762

merge: new input format

3e6e366

merge: try..catch

2cfae6d

ngxson mentioned this pull request Mar 2, 2024

Refactor multi-thread quantize #5830

Merged

refactor

52186ad

ngxson added 3 commits March 3, 2024 16:06

wip: new format

6573043

self merge ok

a032bb6

implement slerp

10c477b

dnhkng mentioned this pull request Mar 7, 2024

Merging quantized model with pass through arcee-ai/mergekit#184

Open

dnhkng mentioned this pull request Mar 16, 2024

model test request EQ-bench/EQ-Bench#20

Closed

ngxson added the demo Demonstrate some concept or idea, not intended to be merged label Apr 18, 2024

ngxson mentioned this pull request Jun 11, 2024

Add support for control vectors #5970

Merged

WIP: Add model merge example #5741

Are you sure you want to change the base?

WIP: Add model merge example #5741

Conversation

ngxson commented Feb 26, 2024 • edited Loading

sorasoras commented Feb 27, 2024

ngxson commented Feb 27, 2024

sorasoras commented Feb 29, 2024 • edited Loading

dnhkng commented Feb 29, 2024 • edited Loading

ngxson commented Feb 29, 2024

ngxson commented Feb 29, 2024

dnhkng commented Feb 29, 2024 • edited Loading

ngxson commented Feb 29, 2024 • edited Loading

dnhkng commented Feb 29, 2024 • edited Loading

ngxson commented Feb 29, 2024 • edited Loading

dnhkng commented Feb 29, 2024 • edited Loading

ngxson commented Mar 1, 2024

ngxson commented Mar 1, 2024 • edited Loading

dnhkng commented Mar 1, 2024 • edited Loading

ngxson commented Mar 1, 2024 • edited Loading

dnhkng commented Mar 1, 2024

dnhkng commented Mar 1, 2024 • edited Loading

ngxson commented Mar 1, 2024 • edited Loading

dnhkng commented Mar 1, 2024 • edited Loading

dnhkng commented Mar 2, 2024 • edited Loading

ngxson commented Mar 2, 2024

dnhkng commented Mar 2, 2024 • edited Loading

ngxson commented Mar 2, 2024

dnhkng commented Mar 2, 2024 • edited Loading

ngxson commented Mar 2, 2024 • edited Loading

ngxson commented Mar 2, 2024 • edited Loading

dnhkng commented Mar 2, 2024 • edited Loading

ngxson commented Mar 3, 2024

ngxson commented Mar 3, 2024 • edited Loading

dnhkng commented Mar 3, 2024

jukofyork commented Mar 5, 2024 • edited Loading

jukofyork commented Mar 5, 2024 • edited Loading

ngxson commented Mar 5, 2024

ngxson commented Mar 5, 2024

dnhkng commented Mar 5, 2024

ngxson commented Mar 5, 2024

dnhkng commented Mar 9, 2024

ngxson commented Mar 9, 2024

sorasoras commented Apr 17, 2024

ngxson commented Apr 18, 2024 • edited Loading

WIP: Add model `merge` example #5741

WIP: Add model `merge` example #5741

ngxson commented Feb 26, 2024 •

edited

Loading

sorasoras commented Feb 29, 2024 •

edited

Loading

dnhkng commented Feb 29, 2024 •

edited

Loading

dnhkng commented Feb 29, 2024 •

edited

Loading

ngxson commented Feb 29, 2024 •

edited

Loading

dnhkng commented Feb 29, 2024 •

edited

Loading

ngxson commented Feb 29, 2024 •

edited

Loading

dnhkng commented Feb 29, 2024 •

edited

Loading

ngxson commented Mar 1, 2024 •

edited

Loading

dnhkng commented Mar 1, 2024 •

edited

Loading

ngxson commented Mar 1, 2024 •

edited

Loading

dnhkng commented Mar 1, 2024 •

edited

Loading

ngxson commented Mar 1, 2024 •

edited

Loading

dnhkng commented Mar 1, 2024 •

edited

Loading

dnhkng commented Mar 2, 2024 •

edited

Loading

dnhkng commented Mar 2, 2024 •

edited

Loading

dnhkng commented Mar 2, 2024 •

edited

Loading

ngxson commented Mar 2, 2024 •

edited

Loading

ngxson commented Mar 2, 2024 •

edited

Loading

dnhkng commented Mar 2, 2024 •

edited

Loading

ngxson commented Mar 3, 2024 •

edited

Loading

jukofyork commented Mar 5, 2024 •

edited

Loading

jukofyork commented Mar 5, 2024 •

edited

Loading

ngxson commented Apr 18, 2024 •

edited

Loading