-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try whether OpenLLaMa works #1291
Comments
sounds good. I see inference tests in the CI coming |
|
(for 200bt)
|
The fact that this new model works here is great as it means that we can move beyond the leaked Llama into a truly open model while keeping the existing GGML code and applications. @Green-Sky were any changes to llama.cpp/convert.py required to get this model to load? |
Ok, so, Perplexity seems to spiral out of control. Something must be wrong. @eiery no, no changes, but I made sure to call the convert.py from the model subdirectory, to make sure it picks the correct tokenizer. |
@Green-Sky Should be fixed by bf4b22f |
@ggerganov I was on bf4b22f
it was segfaulting before 😄 |
It seems to break down after resetting the context. |
The OpenLLaMA generation fails when the prompt does not start with the For Does anyone know if OpenLLaMA's behavior is correct? LLaMA (vanilla)
|
but why waste an extra slot in context? |
I guess it makes the generation more accurate. I think it depends on how was the training performed. The good news is that the fix is trivial |
FWIW, this set of instructions worked for me on a Windows 11 machine; they did not work on an Intel mac:
And these worked for me to convert/quantize the newly released 300b model:
The results are slightly better but still pretty bad. We need an RLHF / Alpaca on top of openllama. |
EDIT: never mind, I didn't see the perplexity is lower with this. |
Appreciate if you can share the converted ggml model and show how it works in a colab notebook. Thank you. |
edit: solved this. git-lfs wasn't installed, this is a new WSL distro I've setup and forgot about that. seems potentially a good check to check the file size, contents or something similar and throw a more verbose warning in that case.
|
Uploaded the ggml weights to huggingface https://huggingface.co/vihangd/open_llama_7b_300bt_ggml |
confirmed sha256 for the q5_1, so probably good |
https://github.com/openlm-research/open_llama#update-05152023 Update 05/15/2023After receiving feedback from the community, we discovered that the tokenizer of our previous checkpoint release was configured incorrectly so that new lines are not preserved. To fix this problem, we have retrained our tokenizer and restarted the model training. We’ve also observed lower training loss with this new tokenizer. They have also released new previews for 3B and 7B variants. |
Uploaded the quantized weights for the latest 7B 400bt variant at https://huggingface.co/vihangd/open_llama_7b_400bt_ggml |
If we get good results here please consider using OpenLlama as an official recommended model for llama.cpp that can be publicly shared. For development we don't need the latest and greatest model, we need something which is compatible with Llama and can be used for running regressions and the like. I'm not sure how consistent the Github CI system is in terms of performance but having new PRs actually run a real-world test on this model might be useful in the long run. |
the 7B is still too large for CI, but the 3B ... maybe, with action cache, it looks kinda attractive... |
Thanks. It works! |
The 3B is a new variant of the llama architecture. so you need to modify the code to make it work.
@ggerganov the n_ff calculated here Line 900 in 2b26469
the model file uses 8640, while that formula calculates 8704. Also running perplexity on the 3B is very bad.
so there might be more changes needed, or @young-geng and team did an oopsy again. |
Should |
For your information, you may refer to the following similar works: By the way, appreciate if you could share the link of the hugging face repo here. Thanks. |
|
According to LLAMA paper, the downstream task performance keeps improving even at 1T tokens (figure above), I think the gap between |
Yes indeed. Simply fitting the data from figure 2 to curves gives a clearer picture. But at some point the accuracy will stop increasing depending on the model size. Maybe OpenLLaMA will continue training the models until that point is found. |
Now that support for 3B models it available, I tested it using latest release files and the weights from: 1. Pure CPU 2. 2 layers to IGP 3. 12 layers to IGP 4. All layers in IGP llama_print_timings: load time = 8076.56 ms 5. All layers in IGP with a prompt requiring longer response. The first 3 runs had the answer, followed by three different random texts with plenty of hallucinations. In test 4 I was surprised by its on point answer and tried a different prompt. The prompts are probably not in the correct format but the relative performance degradation can be seen. Baed on this rather unscientific tests, I am going to use CPU only inference. Though the IGP can access upto 2GB of RAM, it makes no difference. going through the IGP slows things down for my laptop. |
I have ggml versions:
All uploaded now. |
With 3B and 7B released it would be nice for someone with a beefy machine to get perplexity results for the most popular quants. |
Perplexity on wiki.test.raw: openllama-3b-q5_1 : 7.84273862 Remember, perplexity is a measure of how "unsure" the model is at predicting the text in the specified file. This is ok when comparing a model with itself, like different quantization formats. A better measure if you want to compare a model with another model would be Language Model Evaluation Harness . See Open LLM Leaderboard . |
That was my intention all along, as I wanted to see how well the model quantizes against the f16 baseline. Ideally results should be similar to the original Llama but you don't know until you try... |
OK, I did a perplexity run of the new 3B, you can see how it compares to the last one.
|
Some more comparative perplexity analysis done by @gjmulder: https://github.com/openlm-research/open_llama/discussions/41 |
7B run done:
|
Hi there, wonderful work! Has this:
landed to master? I am still having the issue #1291 (comment)
when doing the conversion/quantization myself from master (ae9663f) and from HF/openlm-research 3B (q8_0), while the model you posted on HF works for me. A missing backport? |
@raffienficiaud we merged the 3B changes without the python conversion script changes. @SlyEcho is there an open pr with the hacky python changes? |
@SlyEcho I'm trying out the F16 7B model but I'm getting not so good output. I'm using ctransformers. May I know what values did you use for the config? |
There is a diff file there where the hacks are used. The whole conversion workflow should be possible to do without much dependencies just using the Makefile there.
No, I don't think so. I see a couple of options to fix it:
I didn't create the model, so I don't really know but it may have something to do with the tokenizer:
|
Re/ weird outputs, OpenLLaMA seems to have extra dropout layers in attention and feed-forward (here's something I hacked on tinygrad to make it work). And potentially some versions have an extra layernorm after the embedding layer (see HF's OpenLlamaModel and how it differs from their LlamaModel). |
@ochafik Those dropouts are never used during the pre-training of the model, so I believe that they can be safely ignored. The corresponding model on transformers should be the standard LLaMA instead of Open-Llama. |
@young-geng ahhh, now it make sense, thank you! |
new open llama just dropped https://huggingface.co/openlm-research/open_llama_7b_v2
|
I have ggml files for v2: https://huggingface.co/SlyEcho/open_llama_7b_v2_ggml/tree/main |
Version 2 of the 3b open llama model: https://huggingface.co/openlm-research/open_llama_3b_v2 |
Uploading 3Bv2: https://huggingface.co/SlyEcho/open_llama_3b_v2_ggml |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
... or whether we need to tweak some settings
GitHub: https://github.com/openlm-research/open_llama
HuggingFace: https://huggingface.co/openlm-research/open_llama_7b_preview_300bt
edit: GGML models uploaded to HH by @vihangd => https://huggingface.co/vihangd/open_llama_7b_300bt_ggml
The text was updated successfully, but these errors were encountered: