Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Gemma2ForCausalLM #8156

Merged
merged 12 commits into from
Jun 28, 2024
Merged

Add support for Gemma2ForCausalLM #8156

merged 12 commits into from
Jun 28, 2024

Conversation

pculliton
Copy link
Contributor

@pculliton pculliton commented Jun 27, 2024

Adds inference support for the Gemma 2 family of models. Includes support for:

  • Gemma 2 27B
  • Gemma 2 9B

Updates Gemma architecture to include post-norm among other features.

Created in collaboration with @abetlen and @zichuan-wei.

src/llama.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added the python python script changes label Jun 27, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 27, 2024
@abetlen abetlen changed the title Inference support for Gemma 2 model family Add support for Gemma2ForCausalLM Jun 27, 2024
@fizzAI
Copy link

fizzAI commented Jun 27, 2024

seems like tokenizing special tokens is broken, at least w/ currently existing quants (image is from gemma 2 9b, haven't tried the 27b but i assume it has similar problems)
image

@pabl-o-ce
Copy link

we need a merge in here!

@N8python
Copy link

Indeed. Gemma 2 is awesome.

@bartowski1182
Copy link
Contributor

in case anyone else comes in here ready to merge it there needs to be some kinda fix for the tokenizer, hopefully the smart people are working on it!

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Jun 28, 2024

Just for information, I went ahead and quantized the official gguf that Google provided which ended up in a success. However in the gguf metadata, I am not seeing any mention of eot_token_ids. This might cause problems (Im currently downloading for testing).

image

The huggingface implementation is broken for some reason. The model in Google AI studio gives better generations than HFchat for example.

Anyways, thank you for your hard work!

@bartowski1182
Copy link
Contributor

@qnixsynapse i used the official google GGUFs as well and they still have the tokenization issue

@abetlen
Copy link
Collaborator

abetlen commented Jun 28, 2024

image

Tokenizer should match the hf implementation

src/llama.cpp Outdated Show resolved Hide resolved
src/llama.cpp Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
@qnixsynapse
Copy link
Contributor

qnixsynapse commented Jun 28, 2024

Yup.. The tokenizer is broken in the official gguf as well. :(

image

Also, please note: HF implementation seems broken as well. The model doesn't stop generating possibly because it doesn't stop at <eot> token which is different from <end_of_turn>, and often repeats sentences.

Update: LLaMA.cpp tokenizer issue has been fixed and the 9B model is working as intended. Only issue is it is very large for my GPU.

abetlen and others added 3 commits June 27, 2024 17:37
@slaren
Copy link
Collaborator

slaren commented Jun 28, 2024

I have tried converting the the 9b base and it models from the hf safetensors files. The it model seems to be working as expected, the tokenization looks good and the chat template seems to work correctly. However, the base model has very high perplexity and the generation doesn't look very good. Since the it model is working, I am not sure if this is really a problem with this PR, or with the model itself.

gemma-2-9b:

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b/ggml-model-Q4_K.gguf -ngl 99 --chunks 100
[1]1861.6620,[2]2170.0432,[3]1159.5665,[4]1692.6996,[5]2890.0847,[6]4647.2499,[7]6954.2845,[8]9174.6434,[9]13352.0411,[10]18267.1633,[11]16231.0406,[12]16017.4615,[13]19076.9448,[14]14166.7818,[15]13624.4592,[16]13272.0699,[17]9863.4886,[18]11315.7678,[19]9966.8929,[20]9129.9468,[21]8404.1450,[22]8411.3216,[23]6861.7979,[24]6156.1019,[25]5841.8660,[26]4969.6577,[27]4820.8850,[28]4757.2740,[29]4507.6371,[30]4687.7843,[31]4700.1261,[32]4698.0268,[33]4692.1118,[34]5039.9443,[35]4871.2378,[36]5441.1583,[37]5923.1857,[38]5923.1173,[39]6167.4402,[40]6322.1162,[41]6102.4051,[42]6160.2985,[43]6244.5577,[44]5904.3536,[45]5730.4355,[46]5699.8088,[47]6038.5946,[48]6111.5388,[49]6128.9299,[50]6493.5552,[51]6568.1416,[52]6929.8189,[53]7311.9868,[54]7174.5570,[55]7633.7870,[56]7526.0812,[57]7696.2383,[58]7995.5154,[59]8333.5241,[60]8449.5779,[61]8552.8836,[62]9314.7543,[63]9903.9159,[64]10409.7262,[65]10709.2410,[66]11416.1324,[67]11587.1394,[68]11796.1724,[69]11732.5142,[70]12063.6012,[71]12337.9958,[72]13077.1109,[73]12830.0949,[74]12399.4043,[75]12377.9508,[76]12336.7472,[77]11905.8733,[78]11030.7014,[79]11018.3322,[80]10680.8284,[81]10765.6267,[82]10579.9094,[83]10406.6184,[84]10669.7679,[85]10968.2154,[86]11200.0454,[87]11140.6993,[88]11130.7660,[89]10841.0465,[90]10768.5915,[91]10664.8655,[92]10821.2156,[93]10716.0496,[94]10709.4149,[95]10813.4519,[96]10757.6222,[97]10734.2429,[98]10936.2391,[99]11298.2407,[100]11540.1889,
Final estimate: PPL = 11540.1889 +/- 1174.62311

gemma-2-9b-it:

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-Q4_K.gguf -ngl 99 --chunks 100
[1]17.6105,[2]19.1613,[3]16.0821,[4]16.5763,[5]17.1408,[6]17.6601,[7]18.0250,[8]19.1484,[9]21.1379,[10]23.3855,[11]22.7667,[12]23.7827,[13]25.2157,[14]23.1714,[15]21.7614,[16]21.7440,[17]19.9776,[18]20.5136,[19]20.0096,[20]19.8400,[21]19.4341,[22]19.1846,[23]18.0286,[24]17.2184,[25]16.7988,[26]15.9489,[27]15.7145,[28]15.6195,[29]15.3140,[30]15.4763,[31]15.3822,[32]15.4596,[33]15.5532,[34]15.8476,[35]15.8289,[36]16.0870,[37]16.2538,[38]16.0101,[39]16.0826,[40]16.0872,[41]15.7769,[42]15.7727,[43]15.8494,[44]15.6142,[45]15.4457,[46]15.3842,[47]15.5959,[48]15.6520,[49]15.7287,[50]15.8375,[51]15.8280,[52]15.8684,[53]16.0357,[54]15.9402,[55]16.0614,[56]16.0028,[57]15.9716,[58]16.1636,[59]16.2659,[60]16.3426,[61]16.3141,[62]16.4511,[63]16.6063,[64]16.8268,[65]17.0608,[66]17.2731,[67]17.1275,[68]17.1445,[69]17.1195,[70]17.1854,[71]17.3439,[72]17.4241,[73]17.4784,[74]17.3750,[75]17.3672,[76]17.3783,[77]17.4313,[78]17.2628,[79]17.3120,[80]17.1772,[81]17.2784,[82]17.2347,[83]17.2342,[84]17.3882,[85]17.6236,[86]17.7678,[87]17.8289,[88]17.7847,[89]17.6697,[90]17.6861,[91]17.6671,[92]17.8555,[93]17.9145,[94]17.9591,[95]18.0293,[96]18.1043,[97]18.0753,[98]18.0881,[99]18.2801,[100]18.3496,
Final estimate: PPL = 18.3496 +/- 0.46433

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the it model seems to be working, it may be ok to merge this now.

@qnixsynapse
Copy link
Contributor

pplx of 18? That doesn't seem normal for a Q4 9B parameter based model. llama 3 8B has ~6.7 . I think we should hold on a bit.

@slaren
Copy link
Collaborator

slaren commented Jun 28, 2024

It's normal for an instruction tuned model.

@qnixsynapse
Copy link
Contributor

llama-3-8B instruction tuned has like 6.8-7.1 which I tested a while ago, same quant.

src/llama.cpp Outdated Show resolved Hide resolved
@qnixsynapse
Copy link
Contributor

This looks good for now but still has a high pplx.
image

@ddh0
Copy link
Contributor

ddh0 commented Jun 28, 2024

This looks good for now but still has a high pplx.

Are you sure you're using the right prompt format in that interactive session? It looks like there are increasing newlines after each of the model's responses. (2, then 3, then looks like 5)

@qnixsynapse
Copy link
Contributor

@ddh0 Those newlines are outputted by the model and yes I am using the correct prompt format.

@abetlen abetlen merged commit e57dc62 into ggerganov:master Jun 28, 2024
53 of 54 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jun 28, 2024
* Inference support for Gemma 2 model family

* Update convert-hf-to-gguf.py, constants, and tensor mappings

* cleanup

* format fix

* Fix special token vocab bug

* Don't add space prefix

* fix deleted lines

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add model type names

* Add control vector

* Fix model type identification

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
@koth
Copy link

koth commented Jun 28, 2024

Gemma2's logits soft-cap should also add for 27B inference, otherwise the output will be useless.

@bartowski1182
Copy link
Contributor

Not useless.. but definitely not max quality. They say that the difference is small in their report but there could be some downstream tasks that are more affected than they expect

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Jun 28, 2024

@koth That is only for training. It is a type of regularization, so that the logits do not cross a certain value. It has been removed because it is incompatible with current implementation of flash attention.

Update: WOW looks like it is really needed in 27B (huggingface/transformers#31698)

@Vaibhavs10
Copy link
Collaborator

Yeah! VB from HF here. Without Soft capping, we found that the 27B would overgenerate and mostly result in incoherent text.
This is especially true for the 27B, unfortunately this means that FA2 won't be compatible :/

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jun 28, 2024
* Inference support for Gemma 2 model family

* Update convert-hf-to-gguf.py, constants, and tensor mappings

* cleanup

* format fix

* Fix special token vocab bug

* Don't add space prefix

* fix deleted lines

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add model type names

* Add control vector

* Fix model type identification

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
@N8python
Copy link

I can confirm that the 27B is very drunk right now.

@N8python
Copy link

Example Interaction:

User:
list ten types of fruit

Assistant:
Here are ten types of fruit:

Apple

Banana

Orange

Strawberry

Blueberry

Grapefruit

Strawberry

Grapefruit

Strawberry

Grapefruit

Let me know if you'd like a list of ten different types! I can give you a list of unique fruits, or maybe you have a specific type in mind?

@foldl
Copy link
Contributor

foldl commented Jun 29, 2024

It seems that interleaved SWA and full attention has not been implemented. Right?

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
* Inference support for Gemma 2 model family

* Update convert-hf-to-gguf.py, constants, and tensor mappings

* cleanup

* format fix

* Fix special token vocab bug

* Don't add space prefix

* fix deleted lines

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add model type names

* Add control vector

* Fix model type identification

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
@werruww
Copy link

werruww commented Jun 30, 2024

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-Q4_K.gguf -ngl 99 --chunks 100

how to run it code
gemma 9b not run on llama cpp code but run on ollama and lmstudio

MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
* Inference support for Gemma 2 model family

* Update convert-hf-to-gguf.py, constants, and tensor mappings

* cleanup

* format fix

* Fix special token vocab bug

* Don't add space prefix

* fix deleted lines

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add model type names

* Add control vector

* Fix model type identification

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.