-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replit + MPT #145
Replit + MPT #145
Conversation
I can also merge both models into one example if that's preferred. |
Does this support the MPT-7B-StoryWriter model with 65k context length? |
I didn't try it yet, but architecturally there is no difference between the base model and the story-writer fine-tuned model as to my knowledge. I can potentially try it tomorrow. |
Thank you so much @lukasmoellerch for building support for MPT models! All of us @ MosaicML are very excited :)
One thing I want to point out, the StoryWriter model does have two arch changes vs the other models. It uses |
Adding support for max_alibi_bias, seems quite do-able @lukasmoellerch. I’m trying out a qkv_clamp on my branch, let me see if i can get that down by tomorrow. Really looking forward to try out story-writer on cpu, altho I wonder how much ram is required to support 65k context length. |
I found an alternative implementation by folks from nomic-ai. They didn’t implement qkv-clamping, didn’t fix the alibi bug and didn’t implement the alibi_max_bias too. But, cool referrence I think |
Sounds good, it seems like a lot of people are excited about the storywriter model, let’s get it integrated as well then, both modifications sound rather straightforward. @Leoputera2407 can you share what you’ve done regarding qkv clipping so far? |
Inference of the StoryWriter model fails:
Quantized model:
The base model works. |
29gigs anyone? are you guys using f16 for the kv ? |
It seems like the context size calculation is actually wrong, but in the wrong direction i.e. it calculates with f32, but later allocates an f16... I'll investigate later. |
I found a solution: I changed the types in the ctx_size calculation to uint64_t and changed the calculation of memory_k and memory_v to type F16
Working output:
|
The generation is not very good, it seems to keep repeating itself after a while: Base model:
StoryWriter model:
|
yea i can see the same
there seems no repeat penalty and other stuff. needs some updates from the llama.cpp codebase. edit: i am using the q4_0, and it seems to die faster. we also need an option to set the ctx size. since we preallocate, i can't run the f16. |
StoryWriter does love to repeat itself but I've found some settings that you can use with HF generate the tend to work pretty well. temperature: 0.8
top_p: 1.0
top_k: 0
repetition_penalty: 1.02
no_repeat_ngram_size: 6 These are the same settings you get by default in our demo space https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter |
I would love to see the common infrastructure of llama.cpp become something like "ggml-llm" and the code for the specific llm architectures (llama, gpt-2, gpt-j, mpt and others) become like add-ons at compile time. |
I uploaded some ggml files, so we can test this more easily https://huggingface.co/Green-Sky/ggml-mpt-7b-storywriter |
Looks like great progress - will be taking a more detailed look soon
Yes, this would be great. Now that we have various examples of LLM inference and I have a better understanding of the general API structure that is necessary, it will be easier to come up with a way to unify all these into a single interface |
@ggerganov I think we might want to separate the model max_seq_length (which e.g. is used in the alibi bias offset) and the amount of k / v memory slots we allocate for inference. I temporarily hardcoded n_ctx in mpt to 4096 because otherwise my MacBook Air wasn't too happy, but this should probably be an inference parameter - should we at least just set it to |
@lukasmoellerch cant run storywriter model anymore 😆 , without or any quantization
|
After much trial and error I found a formula for setting the memory buffer, as it needs more space with each evaluated token: The formula only works if n_batch is equal 1.
or even better set it in main to:
with buf declared globally n_predict should be forced to always be equal to or lower than n_ctx, both can be controlled from command-line. |
Yes, I just merged master with the quantisation changes which means that the object overhead adjustment wasn't correct, still a bit broken though. |
nvm, just had to re-quantize the model |
Yes, I think this is a good workaround for now. Regarding the memory usage during inference, there are 2 things that can help to reduce it:
The second one should significantly reduce the memory usage, but it is quite tricky to get right since the process is manual and very easy to mess-up. We can do both of these optimizations in a later PR, if you prefer to get this merged soon |
I looked into this a bit, but can't really get it to be clean. n_predict is the number of additional tokens, thus the total number of tokens depends on the tokenizer being loaded. On the other hand we need the prompt to know the number of tokens which we don't have in load (and also shouldn't have) - I think we either want to separate k/v tensor creation from model loading or have it as a separate parameter. But I can also do that in a follow-up PR. Let me know what I can still do in this pr. |
examples/replit/main.cpp
Outdated
|
||
// a = self.ln_1(x) | ||
{ | ||
cur = ggml_norm(ctx0, inpL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python implementation uses LayerNorm
- double check if this corresponds to ggml_norm
or ggml_rms_norm
. I'm not sure where to look for the source code of LayerNorm
@@ -6219,6 +6226,36 @@ struct ggml_tensor * ggml_alibi( | |||
return result; | |||
} | |||
|
|||
// ggml_alibi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ggml_alibi | |
// ggml_clamp |
@@ -10831,6 +10871,79 @@ static void ggml_compute_forward_alibi( | |||
} | |||
} | |||
|
|||
|
|||
// ggml_compute_forward_alibi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ggml_compute_forward_alibi | |
// ggml_compute_forward_clamp |
|
||
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, 3); | ||
((float *) b->data)[0] = min; | ||
((float *) b->data)[1] = max; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be surrounded with ggml_scratch_save()
and ggml_scratch_load()
:
Lines 3925 to 3939 in 010203f
// IMPORTANT: | |
// when creating "opt" tensors, always save and load the scratch buffer | |
// this is an error prone process, but it is necessary to support inplace | |
// operators when using scratch buffers | |
// TODO: implement a better way | |
void ggml_scratch_save(struct ggml_context * ctx) { | |
ctx->scratch_save = ctx->scratch; | |
ctx->scratch.data = NULL; | |
} | |
void ggml_scratch_load(struct ggml_context * ctx) { | |
ctx->scratch = ctx->scratch_save; | |
} | |
See how they are used in other operators that pass parameters like this.
This is needed to support scratch buffers later
@lukasmoellerch and everyone else - thanks for this contribution I'll probably play with this in the next days and will try to improve the memory allocation logic. |
Thanks for your patience, I really like the project - let my know if any follow-up PRs are required, would be willing to work on them, was just a bit busy with other stuff last week. |
I just tried today that the storywriter still repeating the story over and over again, is there any trick to avoid it ? |
Repeat penalty is being implemented in pr #184 . |
Note that repetition_penalty from pr #184 (and also as implemented in llama.cpp) is not the same as no_repeat_ngram_size which is used in the MPT-7 HuggingFace space (https://huggingface.co/spaces/mosaicml/mpt-7b-storywriter): |
Implements #131 #136
Adds example code for mpt (https://huggingface.co/mosaicml/mpt-7b) and replit (https://huggingface.co/replit/replit-code-v1-3b). The code isn't too clean at the moment and I'll happily clean things up and implement suggestions, but might only be able to spend more time on this over the weekend.
Some hyperparameters are hardcoded such as the ffn / mlp ratio and the alibi max bias. Also, not all mpt-style models are supported, qkv clamping isn't implemented and a couple of other options aren't considered either.
The unigram tokenizer is comparably slow, but implementing a good one would add considerably more code to the example.
Also: Thank you @Leoputera2407 for helping with debugging the alibi problem