-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Gemma2ForCausalLM #8156
Conversation
we need a merge in here! |
Indeed. Gemma 2 is awesome. |
in case anyone else comes in here ready to merge it there needs to be some kinda fix for the tokenizer, hopefully the smart people are working on it! |
Just for information, I went ahead and quantized the official gguf that Google provided which ended up in a success. However in the gguf metadata, The huggingface implementation is broken for some reason. The model in Google AI studio gives better generations than HFchat for example. Anyways, thank you for your hard work! |
@qnixsynapse i used the official google GGUFs as well and they still have the tokenization issue |
Yup.. The tokenizer is broken in the official gguf as well. :( Also, please note: HF implementation seems broken as well. The model doesn't stop generating possibly because it doesn't stop at Update: LLaMA.cpp tokenizer issue has been fixed and the 9B model is working as intended. Only issue is it is very large for my GPU. |
Co-authored-by: slaren <slarengh@gmail.com>
I have tried converting the the 9b base and it models from the hf safetensors files. The it model seems to be working as expected, the tokenization looks good and the chat template seems to work correctly. However, the base model has very high perplexity and the generation doesn't look very good. Since the it model is working, I am not sure if this is really a problem with this PR, or with the model itself. gemma-2-9b:
gemma-2-9b-it:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the it model seems to be working, it may be ok to merge this now.
pplx of 18? That doesn't seem normal for a Q4 9B parameter based model. llama 3 8B has ~6.7 . I think we should hold on a bit. |
It's normal for an instruction tuned model. |
llama-3-8B instruction tuned has like 6.8-7.1 which I tested a while ago, same quant. |
Are you sure you're using the right prompt format in that interactive session? It looks like there are increasing newlines after each of the model's responses. (2, then 3, then looks like 5) |
@ddh0 Those newlines are outputted by the model and yes I am using the correct prompt format. |
* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
Gemma2's logits soft-cap should also add for 27B inference, otherwise the output will be useless. |
Not useless.. but definitely not max quality. They say that the difference is small in their report but there could be some downstream tasks that are more affected than they expect |
@koth That is only for training. It is a type of regularization, so that the logits do not cross a certain value. It has been removed because it is incompatible with current implementation of flash attention. Update: WOW looks like it is really needed in 27B (huggingface/transformers#31698) |
Yeah! VB from HF here. Without Soft capping, we found that the 27B would overgenerate and mostly result in incoherent text. |
* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
I can confirm that the 27B is very drunk right now. |
Example Interaction: User: Assistant: Apple Banana Orange Strawberry Blueberry Grapefruit Strawberry Grapefruit Strawberry Grapefruit Let me know if you'd like a list of ten different types! I can give you a list of unique fruits, or maybe you have a specific type in mind? |
It seems that interleaved SWA and full attention has not been implemented. Right? |
* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-Q4_K.gguf -ngl 99 --chunks 100 how to run it code |
* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>
Adds inference support for the Gemma 2 family of models. Includes support for:
Updates Gemma architecture to include post-norm among other features.
Created in collaboration with @abetlen and @zichuan-wei.