Add support for Gemma2ForCausalLM #8156

pculliton · 2024-06-27T07:53:11Z

Adds inference support for the Gemma 2 family of models. Includes support for:

Gemma 2 27B
Gemma 2 9B

Updates Gemma architecture to include post-norm among other features.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Created in collaboration with @abetlen and @zichuan-wei.

src/llama.cpp

fizzAI · 2024-06-27T19:13:57Z

seems like tokenizing special tokens is broken, at least w/ currently existing quants (image is from gemma 2 9b, haven't tried the 27b but i assume it has similar problems)

pabl-o-ce · 2024-06-27T22:51:44Z

we need a merge in here!

N8python · 2024-06-27T22:58:49Z

Indeed. Gemma 2 is awesome.

bartowski1182 · 2024-06-27T23:45:31Z

in case anyone else comes in here ready to merge it there needs to be some kinda fix for the tokenizer, hopefully the smart people are working on it!

qnixsynapse · 2024-06-28T00:00:23Z

Just for information, I went ahead and quantized the official gguf that Google provided which ended up in a success. However in the gguf metadata, ~~I am not seeing any mention of eot_token_ids~~. This might cause problems (Im currently downloading for testing).

The huggingface implementation is broken for some reason. The model in Google AI studio gives better generations than HFchat for example.

Anyways, thank you for your hard work!

bartowski1182 · 2024-06-28T00:00:53Z

@qnixsynapse i used the official google GGUFs as well and they still have the tokenization issue

abetlen · 2024-06-28T00:01:07Z

Tokenizer should match the hf implementation

src/llama.cpp

qnixsynapse · 2024-06-28T00:30:34Z

Yup.. The tokenizer is broken in the official gguf as well. :(

Also, please note: HF implementation seems broken as well. The model doesn't stop generating possibly because it doesn't stop at <eot> token which is different from <end_of_turn>, and often repeats sentences.

Update: LLaMA.cpp tokenizer issue has been fixed and the 9B model is working as intended. Only issue is it is very large for my GPU.

Co-authored-by: slaren <slarengh@gmail.com>

slaren · 2024-06-28T02:20:29Z

I have tried converting the the 9b base and it models from the hf safetensors files. The it model seems to be working as expected, the tokenization looks good and the chat template seems to work correctly. However, the base model has very high perplexity and the generation doesn't look very good. Since the it model is working, I am not sure if this is really a problem with this PR, or with the model itself.

gemma-2-9b:

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b/ggml-model-Q4_K.gguf -ngl 99 --chunks 100
[1]1861.6620,[2]2170.0432,[3]1159.5665,[4]1692.6996,[5]2890.0847,[6]4647.2499,[7]6954.2845,[8]9174.6434,[9]13352.0411,[10]18267.1633,[11]16231.0406,[12]16017.4615,[13]19076.9448,[14]14166.7818,[15]13624.4592,[16]13272.0699,[17]9863.4886,[18]11315.7678,[19]9966.8929,[20]9129.9468,[21]8404.1450,[22]8411.3216,[23]6861.7979,[24]6156.1019,[25]5841.8660,[26]4969.6577,[27]4820.8850,[28]4757.2740,[29]4507.6371,[30]4687.7843,[31]4700.1261,[32]4698.0268,[33]4692.1118,[34]5039.9443,[35]4871.2378,[36]5441.1583,[37]5923.1857,[38]5923.1173,[39]6167.4402,[40]6322.1162,[41]6102.4051,[42]6160.2985,[43]6244.5577,[44]5904.3536,[45]5730.4355,[46]5699.8088,[47]6038.5946,[48]6111.5388,[49]6128.9299,[50]6493.5552,[51]6568.1416,[52]6929.8189,[53]7311.9868,[54]7174.5570,[55]7633.7870,[56]7526.0812,[57]7696.2383,[58]7995.5154,[59]8333.5241,[60]8449.5779,[61]8552.8836,[62]9314.7543,[63]9903.9159,[64]10409.7262,[65]10709.2410,[66]11416.1324,[67]11587.1394,[68]11796.1724,[69]11732.5142,[70]12063.6012,[71]12337.9958,[72]13077.1109,[73]12830.0949,[74]12399.4043,[75]12377.9508,[76]12336.7472,[77]11905.8733,[78]11030.7014,[79]11018.3322,[80]10680.8284,[81]10765.6267,[82]10579.9094,[83]10406.6184,[84]10669.7679,[85]10968.2154,[86]11200.0454,[87]11140.6993,[88]11130.7660,[89]10841.0465,[90]10768.5915,[91]10664.8655,[92]10821.2156,[93]10716.0496,[94]10709.4149,[95]10813.4519,[96]10757.6222,[97]10734.2429,[98]10936.2391,[99]11298.2407,[100]11540.1889,
Final estimate: PPL = 11540.1889 +/- 1174.62311

gemma-2-9b-it:

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-Q4_K.gguf -ngl 99 --chunks 100
[1]17.6105,[2]19.1613,[3]16.0821,[4]16.5763,[5]17.1408,[6]17.6601,[7]18.0250,[8]19.1484,[9]21.1379,[10]23.3855,[11]22.7667,[12]23.7827,[13]25.2157,[14]23.1714,[15]21.7614,[16]21.7440,[17]19.9776,[18]20.5136,[19]20.0096,[20]19.8400,[21]19.4341,[22]19.1846,[23]18.0286,[24]17.2184,[25]16.7988,[26]15.9489,[27]15.7145,[28]15.6195,[29]15.3140,[30]15.4763,[31]15.3822,[32]15.4596,[33]15.5532,[34]15.8476,[35]15.8289,[36]16.0870,[37]16.2538,[38]16.0101,[39]16.0826,[40]16.0872,[41]15.7769,[42]15.7727,[43]15.8494,[44]15.6142,[45]15.4457,[46]15.3842,[47]15.5959,[48]15.6520,[49]15.7287,[50]15.8375,[51]15.8280,[52]15.8684,[53]16.0357,[54]15.9402,[55]16.0614,[56]16.0028,[57]15.9716,[58]16.1636,[59]16.2659,[60]16.3426,[61]16.3141,[62]16.4511,[63]16.6063,[64]16.8268,[65]17.0608,[66]17.2731,[67]17.1275,[68]17.1445,[69]17.1195,[70]17.1854,[71]17.3439,[72]17.4241,[73]17.4784,[74]17.3750,[75]17.3672,[76]17.3783,[77]17.4313,[78]17.2628,[79]17.3120,[80]17.1772,[81]17.2784,[82]17.2347,[83]17.2342,[84]17.3882,[85]17.6236,[86]17.7678,[87]17.8289,[88]17.7847,[89]17.6697,[90]17.6861,[91]17.6671,[92]17.8555,[93]17.9145,[94]17.9591,[95]18.0293,[96]18.1043,[97]18.0753,[98]18.0881,[99]18.2801,[100]18.3496,
Final estimate: PPL = 18.3496 +/- 0.46433

slaren

Since the it model seems to be working, it may be ok to merge this now.

qnixsynapse · 2024-06-28T02:26:37Z

pplx of 18? That doesn't seem normal for a Q4 9B parameter based model. llama 3 8B has ~6.7 . I think we should hold on a bit.

slaren · 2024-06-28T02:27:38Z

It's normal for an instruction tuned model.

qnixsynapse · 2024-06-28T02:30:46Z

llama-3-8B instruction tuned has like 6.8-7.1 which I tested a while ago, same quant.

src/llama.cpp

qnixsynapse · 2024-06-28T03:11:50Z

This looks good for now but still has a high pplx.

ddh0 · 2024-06-28T03:20:31Z

This looks good for now but still has a high pplx.

Are you sure you're using the right prompt format in that interactive session? It looks like there are increasing newlines after each of the model's responses. (2, then 3, then looks like 5)

qnixsynapse · 2024-06-28T03:21:31Z

@ddh0 Those newlines are outputted by the model and yes I am using the correct prompt format.

* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

koth · 2024-06-28T13:54:52Z

Gemma2's logits soft-cap should also add for 27B inference, otherwise the output will be useless.

bartowski1182 · 2024-06-28T14:13:42Z

Not useless.. but definitely not max quality. They say that the difference is small in their report but there could be some downstream tasks that are more affected than they expect

qnixsynapse · 2024-06-28T14:17:09Z

@koth That is only for training. It is a type of regularization, so that the logits do not cross a certain value. It has been removed because it is incompatible with current implementation of flash attention.

Update: WOW looks like it is really needed in 27B (huggingface/transformers#31698)

Vaibhavs10 · 2024-06-28T16:34:40Z

Yeah! VB from HF here. Without Soft capping, we found that the 27B would overgenerate and mostly result in incoherent text.
This is especially true for the 27B, unfortunately this means that FA2 won't be compatible :/

* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

N8python · 2024-06-28T19:14:44Z

I can confirm that the 27B is very drunk right now.

N8python · 2024-06-28T19:53:10Z

Example Interaction:

User:
list ten types of fruit

Assistant:
Here are ten types of fruit:

Apple

Banana

Orange

Strawberry

Blueberry

Grapefruit

Strawberry

Grapefruit

Strawberry

Grapefruit

Let me know if you'd like a list of ten different types! I can give you a list of unique fruits, or maybe you have a specific type in mind?

foldl · 2024-06-29T10:45:29Z

It seems that interleaved SWA and full attention has not been implemented. Right?

* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

werruww · 2024-06-30T18:30:55Z

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-Q4_K.gguf -ngl 99 --chunks 100

how to run it code
gemma 9b not run on llama cpp code but run on ollama and lmstudio

* Inference support for Gemma 2 model family * Update convert-hf-to-gguf.py, constants, and tensor mappings * cleanup * format fix * Fix special token vocab bug * Don't add space prefix * fix deleted lines * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add model type names * Add control vector * Fix model type identification --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>

Inference support for Gemma 2 model family

4b91f65

zichuan-wei reviewed Jun 27, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

Update convert-hf-to-gguf.py, constants, and tensor mappings

5bf560b

abetlen approved these changes Jun 27, 2024

View reviewed changes

github-actions bot added the python python script changes label Jun 27, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 27, 2024

Merge branch 'master' into master

6913a17

abetlen changed the title ~~Inference support for Gemma 2 model family~~ Add support for Gemma2ForCausalLM Jun 27, 2024

Galunid mentioned this pull request Jun 27, 2024

Fix gemma model conversion #7992

Closed

Qualzz mentioned this pull request Jun 27, 2024

version 1.47 downloaded, gemma2 error ollama/ollama#5331

Closed

cleanup

06f1ba4

abetlen added 2 commits June 27, 2024 16:07

format fix

c93821e

Fix special token vocab bug

6615d09

Don't add space prefix

446e33f

fix deleted lines

7db26ea

slaren reviewed Jun 28, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

abetlen and others added 3 commits June 27, 2024 17:37

Update src/llama.cpp

cc6ea57

Co-authored-by: slaren <slarengh@gmail.com>

Add model type names

a818a54

Add control vector

ec25dfc

slaren approved these changes Jun 28, 2024

View reviewed changes

slaren reviewed Jun 28, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

Fix model type identification

f7392b7

abetlen merged commit e57dc62 into ggerganov:master Jun 28, 2024
53 of 54 checks passed

mudler mentioned this pull request Jun 28, 2024

⬆️ Update ggerganov/llama.cpp mudler/LocalAI#2671

Merged

This was referenced Jun 28, 2024

Gemma2:27b start to output repetive trash after few generations ollama/ollama#5346

Closed

Gemma 2 9B and 27B is not behaving right ollama/ollama#5341

Closed

nmandic78 mentioned this pull request Jun 28, 2024

Bug: Gemma-2 not supported on b3262 #8195

Closed

ko-alex mentioned this pull request Jul 5, 2024

Bug: gemma 2 27B GGML_ASSERT n_dims <= ne0 #8246

Closed

compilade mentioned this pull request Jul 9, 2024

Adding models to the list in convert-hf-to-gguf-update.py #8357

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Gemma2ForCausalLM #8156

Add support for Gemma2ForCausalLM #8156

pculliton commented Jun 27, 2024 •

edited by mofosyne

Loading

fizzAI commented Jun 27, 2024

pabl-o-ce commented Jun 27, 2024

N8python commented Jun 27, 2024

bartowski1182 commented Jun 27, 2024

qnixsynapse commented Jun 28, 2024 •

edited

Loading

bartowski1182 commented Jun 28, 2024

abetlen commented Jun 28, 2024 •

edited

Loading

qnixsynapse commented Jun 28, 2024 •

edited

Loading

slaren commented Jun 28, 2024

slaren left a comment

qnixsynapse commented Jun 28, 2024

slaren commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

ddh0 commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

koth commented Jun 28, 2024

bartowski1182 commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024 •

edited

Loading

Vaibhavs10 commented Jun 28, 2024

N8python commented Jun 28, 2024

N8python commented Jun 28, 2024

foldl commented Jun 29, 2024

werruww commented Jun 30, 2024

Add support for Gemma2ForCausalLM #8156

Add support for Gemma2ForCausalLM #8156

Conversation

pculliton commented Jun 27, 2024 • edited by mofosyne Loading

fizzAI commented Jun 27, 2024

pabl-o-ce commented Jun 27, 2024

N8python commented Jun 27, 2024

bartowski1182 commented Jun 27, 2024

qnixsynapse commented Jun 28, 2024 • edited Loading

bartowski1182 commented Jun 28, 2024

abetlen commented Jun 28, 2024 • edited Loading

qnixsynapse commented Jun 28, 2024 • edited Loading

slaren commented Jun 28, 2024

slaren left a comment

Choose a reason for hiding this comment

qnixsynapse commented Jun 28, 2024

slaren commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

ddh0 commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024

koth commented Jun 28, 2024

bartowski1182 commented Jun 28, 2024

qnixsynapse commented Jun 28, 2024 • edited Loading

Vaibhavs10 commented Jun 28, 2024

N8python commented Jun 28, 2024

N8python commented Jun 28, 2024

foldl commented Jun 29, 2024

werruww commented Jun 30, 2024

pculliton commented Jun 27, 2024 •

edited by mofosyne

Loading

qnixsynapse commented Jun 28, 2024 •

edited

Loading

abetlen commented Jun 28, 2024 •

edited

Loading

qnixsynapse commented Jun 28, 2024 •

edited

Loading

qnixsynapse commented Jun 28, 2024 •

edited

Loading