Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : Added support for SmolLm pre-tokenizer (#8608) #8609

Merged
merged 5 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -596,6 +596,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e":
# ref: https://huggingface.co/mistralai/Mistral-Nemo-Base-2407
res = "tekken"
if chkhsh == "855059429035d75a914d1eda9f10a876752e281a054a7a3d421ef0533e5b6249":
# ref: https://huggingface.co/HuggingFaceTB/SmolLM-135M
res = "smollm"

if res is None:
logger.warning("\n")
Expand Down
1 change: 1 addition & 0 deletions convert_hf_to_gguf_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "jais", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/core42/jais-13b", },
{"name": "t5", "tokt": TOKENIZER_TYPE.UGM, "repo": "https://huggingface.co/google-t5/t5-small", },
{"name": "tekken", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },
{"name": "smollm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/HuggingFaceTB/SmolLM-135M", },
]


Expand Down
1 change: 1 addition & 0 deletions include/llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_VIKING = 18,
LLAMA_VOCAB_PRE_TYPE_JAIS = 19,
LLAMA_VOCAB_PRE_TYPE_TEKKEN = 20,
LLAMA_VOCAB_PRE_TYPE_SMOLLM = 21,
};

// note: these values should be synchronized with ggml_rope
Expand Down
112 changes: 112 additions & 0 deletions models/ggml-vocab-smollm.gguf.inp
Stillerman marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
ied 4 ½ months
__ggml_vocab_test__
Führer
__ggml_vocab_test__

__ggml_vocab_test__

__ggml_vocab_test__

__ggml_vocab_test__

__ggml_vocab_test__

__ggml_vocab_test__


__ggml_vocab_test__



__ggml_vocab_test__




__ggml_vocab_test__


__ggml_vocab_test__
Hello world
__ggml_vocab_test__
Hello world
__ggml_vocab_test__
Hello World
__ggml_vocab_test__
Hello World
__ggml_vocab_test__
Hello World!
__ggml_vocab_test__
Hello, world!
__ggml_vocab_test__
Hello, world!
__ggml_vocab_test__
this is 🦙.cpp
__ggml_vocab_test__
w048 7tuijk dsdfhu
__ggml_vocab_test__
нещо на Български
__ggml_vocab_test__
កាន់តែពិសេសអាចខលចេញ
__ggml_vocab_test__
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
Hello
__ggml_vocab_test__
(
__ggml_vocab_test__

=
__ggml_vocab_test__
' era
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33
__ggml_vocab_test__
333
__ggml_vocab_test__
3333
__ggml_vocab_test__
33333
__ggml_vocab_test__
333333
__ggml_vocab_test__
3333333
__ggml_vocab_test__
33333333
__ggml_vocab_test__
333333333
__ggml_vocab_test__
Cửa Việt
__ggml_vocab_test__
discards
__ggml_vocab_test__











🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
__ggml_vocab_test__
46 changes: 46 additions & 0 deletions models/ggml-vocab-smollm.gguf.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
885 216 36 216 16738 2704
54 46991 16863

216
256
333
197
198
1116
16506
197 198
19556 905
38699 905
19556 2260
38699 2260
38699 2260 17
19556 28 905 17
38699 28 905 17
451 314 15107 116 243 30 35392
103 32 36 40 216 39 24961 47112 21554 3492 15995
8831 6643 46438 6485 40610 5470 235 156 228 12681 29441 6511 9175 39511 7872
40478 218 40478 131 40478 237 172 249 229 40478 233 172 249 220 40478 240 40478 132 40478 249 172 249 219 40478 249 40478 112 40478 131 40478 223 40478 219 40478 245 40478 223 172 249 219 40478 227
10813 244 218 365 5472 25 40303 131 321 231 10813 230 121 31752 365 30404 649 21658 271 46336 483 25 4636 246 223 365 8979 649 33777 338 553 624 1038 9624 25
19556
38699
216 38699
256 38699
333 38699
333 38699 472 38699
365
198 446
23 5741
19556 28 329 23 449 17 1073 359 346 40303 219 9148 19805 235 177 221 128 32632 21949 36149 115 40994 33 35 33 36 33 37 33 18614 119 186 138 248
36689 10095
35
35 35
35 35 35
35 35 35 35
35 35 35 35 35
35 35 35 35 35 35
35 35 35 35 35 35 35
35 35 35 35 35 35 35 35
35 35 35 35 35 35 35 35 35
51 25275 251 81 10506 25275 225 100
937 1563
3805 8866 1116 3805 197 216 1656 216 197 11181 472 2367 3914 198 10813 244 218 365 5472 25 40303 131 321 231 10813 230 121 31752 365 30404 649 21658 271 46336 483 25 4636 246 223 15107 116 243 10813 116 243 216 35 216 35 35 216 35 35 35 216 35 35 35 35 216 35 35 35 35 35 216 35 35 35 35 35 35 216 35 35 35 35 35 35 35 216 35 35 35 35 35 35 35 35 216 35 30 35 216 35 950 35 216 35 2026 35 15822 248 218 40478 131 40478 237 172 249 229 40478 233 172 249 220 40478 240 40478 132 40478 249 172 249 219 40478 249 40478 112 40478 131 40478 223 10813 242 219 9148 19805 235 177 221 128 32632 21949 36149 115 40994 33 35 33 36 33 37 33 18614 119 186 138 248 216 21771 2031 28733 28050 6643 46438 6485 40610 5470 235 156 228 12681 29441 6511 9175 39511 7872 7855 11193 1969 1969 3725 1093 1093 5592 950 36689 10095 16693 16693 16693 339 3543 719 637 100 793 384 506 665 28 637 3256 346 2090 47 637 61 441 2090 339 3060 919 357 28 637 52 346 702 634 7188 47 1046 23 25917 253 23 92 60
5 changes: 5 additions & 0 deletions src/llama.cpp
Stillerman marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5530,6 +5530,10 @@ static void llm_load_vocab(
vocab.tokenizer_clean_spaces = false;
vocab.tokenizer_ignore_merges = true;
vocab.tokenizer_add_bos = true;
} else if (
tokenizer_pre == "smollm") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_SMOLLM;
Stillerman marked this conversation as resolved.
Show resolved Hide resolved
vocab.tokenizer_clean_spaces = false;
} else {
throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
}
Expand Down Expand Up @@ -15554,6 +15558,7 @@ struct llm_tokenizer_bpe {
case LLAMA_VOCAB_PRE_TYPE_STARCODER:
case LLAMA_VOCAB_PRE_TYPE_REFACT:
case LLAMA_VOCAB_PRE_TYPE_COMMAND_R:
case LLAMA_VOCAB_PRE_TYPE_SMOLLM:
regex_exprs = {
"\\p{N}",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
Expand Down
Loading