-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add support for llama2.c models #2379
Comments
I can take a stab at it. Been meaning to dive deeper into the GGML format. Since Would an existing model (HF/PyTorch) serve as a good starting point? |
Trying to add this to The easiest to understand description of the file format is probably in the training example here: https://github.com/ggerganov/llama.cpp/blob/41c674161fb2459bdf7806d1eebead15bc5d046e/examples/train-text-from-scratch/train-text-from-scratch.cpp#L2609 |
@ggerganov why not use the safetensor format? seems way more practical than custom binary ggml formats |
@Mistobaan See this note in the spec of the upcoming gguf fileformat gguf.md#why-not-other-formats and PR ggml-org/ggml#302 |
began a super WIP (not completely functional) attempt at this here |
@byte-6174 nice - I took a similar approach. Currently finding myself going deep in the rabbit hole of converting llama2.c tensors that use calloc based memory assignment (with no metadata afaik) to Also I don’t quite understand yet why does the vocab need to be saved in the model, when it is also available in an external file? In any case, I believe llama2.c is using the exact same format for vocab. I’ll end this update with a few words in the voice of Yoda, “Long journey, it is. Learn, I must.” |
re. mapping can someone with more exp with llama.cpp tensors point to how these RoPE tensors should be mapped? Indicated with |
Not 100%, but I believe these are lookup tables for the RoPE, and are not necessary for llama.cpp. |
Right, I looked at llama2.c code and its surely for RoPE, good to know it's not needed for llama.cpp. I can remove it. |
You can run the most basic inference using: |
Got it, it runs pretty well!. gives 359 tok/sec, vs.
|
humm, one other difference for "non-english" words could be that the vocabs are not matching. |
Nah something else is wrong. First try adding |
default run above has eps = 1e-6, this 👇 is with 1e-5 as you suggest:
|
I'm printing the llama2.c first 5 elements of w->rms_final_weight >> vs. ggml's model-> norm >> |
We haven't ran any F32 models with
And see if the new |
yes, still nonsense. I'm currently investigating how the ff weights looking more to see if I'm making a mistake in putting them in the tensors in the right order... |
aah! I found a bug. I was not using the right multiplier to convert the 1D arrays in
|
and here with quantization:
|
Great! You should use context size of 256: |
sure-
|
just focusing on timing a bit, seems with
|
The Q8_0 generation looks broken. Either it does not have enough precision somehow or there is still some lingering issue |
humm, you mean as far as we can judge from the output, some words look random...yes? |
also, - perhaps not relevant, perhaps it is - but |
The F16 output seems OK up to 256 tokens which means it's probably not related to RoPE. |
I believe it is this: |
I can add a readme with some instructions and a summary of our findings and send a PR. |
That would be great! |
The tokenizer.bin is train by my self |
Try setting |
Just sent another update that should fix some of the issues. In this conversion, we are using the vocabulary file available at |
@saltyduckegg we are using the vocab model available in the llama.cpp repository. Please use that instead and let me know if it works for you. |
thank you for your help , |
It can run indeed, but this is not what I want. It seems to have messed up the letter encoding, of course this seems to be because this is not the encoding table I use to train.
|
I created a PR against #2559 to support loading the llama2.c vocabulary that might help you, @saltyduckegg, if you created your own vocabulary: https://github.com/byte-6174/llama.cpp/pull/1/files The |
Cool!
|
Yep, something seems broken :) Can you put your model somewhere? Or run with gdb to get a stack trace or put the core somewhere to investigate? |
I’m very sorry for taking so long, because I found that my model file is larger than 25m and cannot be sent directly on github. So I uploaded it to hugeface. This is a mini model trained with Chinese dialogue data.
|
I just tested it with @jrudolph 's update and for all 3 models we can use llama2.c vocab binary optionally. I will send another update to the PR with this. |
@saltyduckegg I tried running your model from the hugginface with llama2.c repo and it gives me the following. Are you able to get good output when you use llama2.c repo with this model?
|
My result is expected, at least some of it are sentences I can understand. Your result is somewhat like the output under a incorrect tokenizer.binl.
|
no, I can not reproduce your output with your model and stock llama2.c run. your model seems to have some corruption.
|
No, it's a problem with main.cpp because it expects that it can tokenize the instruction prefix/suffix and newline but the vocabulary does not include them (and it's also not needed for non-instruction mode). Backtrace
It works when applying this diff: +++ b/examples/main/main.cpp
@@ -256,11 +256,13 @@ int main(int argc, char ** argv) {
}
// prefix & suffix for instruct mode
- const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
- const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+ std::vector<llama_token> inp_pfx;
+ std::vector<llama_token> inp_sfx;
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
+ inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
+ inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
params.interactive_first = true;
params.antiprompt.push_back("### Instruction:\n\n");
}
@@ -270,9 +272,6 @@ int main(int argc, char ** argv) {
params.interactive = true;
}
- // determine newline token
- auto llama_token_newline = ::llama_tokenize(ctx, "\n", false);
-
if (params.verbose_prompt) {
fprintf(stderr, "\n");
fprintf(stderr, "%s: prompt: '%s'\n", __func__, params.prompt.c_str()); @saltyduckegg it might still make sense to include enough tokens to represent these strings as well. |
okay, but I am running from llama2.c and I get the output above?! how do we explain that?! |
Probably llama2.c is not picking up the custom |
right, depends on how @saltyduckegg saved the custom tokenizer. |
It seems to work for me with llama2.c if the custom |
yes!, forgot about that it is hardcoded. |
The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon.
We should provide a simple conversion tool from
llama2.c
bin format toggml
format so we can run inference of the models inllama.cpp
Great task for people looking to get involved in the project
The text was updated successfully, but these errors were encountered: