Supports SmolLM #495

Stillerman · 2024-07-21T06:52:56Z

These changes are needed to make a gguf for SmolLM work with llamafile. The gguf was generated with this PR for llama.cpp

Tested with

llamafile-convert smol-135M.gguf
./smol-135M.llamafile

jart · 2024-07-22T06:32:57Z

Wow thanks for sending this. I checked out your llama.cpp PR and I'm having trouble creating a GGUF file. Any ideas?

Traceback (most recent call last):
  File "/home/jart/llama.cpp/convert_hf_to_gguf.py", line 3673, in <module>
    main()
  File "/home/jart/llama.cpp/convert_hf_to_gguf.py", line 3666, in main
    model_instance.write()
  File "/home/jart/llama.cpp/convert_hf_to_gguf.py", line 400, in write
    self.prepare_metadata(vocab_only=False)
  File "/home/jart/llama.cpp/convert_hf_to_gguf.py", line 348, in prepare_metadata
    self.metadata = gguf.Metadata.load(self.metadata_override, self.dir_model, self.model_name, total_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jart/llama.cpp/gguf-py/gguf/metadata.py", line 59, in load
    metadata = Metadata.apply_metadata_heuristic(metadata, model_card, hf_params, model_path, total_params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jart/llama.cpp/gguf-py/gguf/metadata.py", line 396, in apply_metadata_heuristic
    model_full_name_component, org_component, basename, finetune, version, size_label = Metadata.get_model_id_components(model_id, total_params)
                                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jart/llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
                                      ~~~~^^^
IndexError: string index out of range

I ran this command:

$?=1 main jart@luna:/fast/hf/SmolLM-135M$ ~/llama.cpp/convert_hf_to_gguf.py --outtype bf16 .

Stillerman · 2024-07-22T16:33:12Z

Hmmm looking into this now

Stillerman · 2024-07-22T17:18:09Z

Could you try with a fresh copy of llama.cpp at d94c6e0 with the following environment

(justine) jts@Jasons-MacBook-Air justine % uv pip freeze
certifi==2024.7.4
charset-normalizer==3.3.2
filelock==3.15.4
fsspec==2024.6.1
huggingface-hub==0.24.0
idna==3.7
jinja2==3.1.4
markupsafe==2.1.5
mpmath==1.3.0
networkx==3.3
numpy==1.26.4
packaging==24.1
pyyaml==6.0.1
regex==2024.5.15
requests==2.32.3
safetensors==0.4.3
sentencepiece==0.2.0
sympy==1.13.1
tokenizers==0.19.1
torch==2.3.1
tqdm==4.66.4
transformers==4.42.4
typing-extensions==4.12.2
urllib3==2.2.2

download.py

from huggingface_hub import snapshot_download

# https://huggingface.co/HuggingFaceTB/SmolLM-135M
model_id="HuggingFaceTB/SmolLM-135M"
snapshot_download(repo_id=model_id, local_dir="smol-135",
                    local_dir_use_symlinks=False, revision="main")

and then run

python download.py
python convert_hf_to_gguf.py smol-135 --outtype bf16

I was able to then run

make -j8
./llama-cli -m "smol-135/smol-135M-135-BF16.gguf" -p "hi there llama\!"`

and it seems to inference

jart · 2024-07-22T17:28:50Z

Conversion works now. Although it's a little weird the filename that it chooses.

jart

Fantastic. This model goes wicked fast on CPU.

llama_print_timings:        load time =      59.98 ms
llama_print_timings:      sample time =       1.16 ms /    30 runs   (    0.04 ms per token, 25862.07 tokens per second)
llama_print_timings: prompt eval time =      45.54 ms /   203 tokens (    0.22 ms per token,  4457.82 tokens per second)
llama_print_timings:        eval time =     185.74 ms /    29 runs   (    6.40 ms per token,   156.14 tokens per second)
llama_print_timings:       total time =     237.18 ms /   232 tokens
Log end
smol jart@luna:~/llamafile$ ls -hal /weights/SmolLM-135M.BF16.gguf
-rw-rw-r-- 1 jart jart 259M Jul 22 10:29 /weights/SmolLM-135M.BF16.gguf

Thank you! Approved!

Stillerman · 2024-07-23T02:23:29Z

Llamafiles for all SmolLM models can be found here.

jart · 2024-07-23T02:38:06Z

Nice!

Supports SmolLM

946b204

github-actions bot added the llama.cpp label Jul 21, 2024

jart self-requested a review July 22, 2024 06:24

jart approved these changes Jul 22, 2024

View reviewed changes

jart merged commit cc30400 into Mozilla-Ocho:main Jul 22, 2024
2 checks passed

Stillerman mentioned this pull request Jul 22, 2024

Bug: unsupported op 'MUL_MAT' on bf16 but not f16 on SmolLM #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports SmolLM #495

Supports SmolLM #495

Stillerman commented Jul 21, 2024 •

edited

Loading

jart commented Jul 22, 2024

Stillerman commented Jul 22, 2024

Stillerman commented Jul 22, 2024 •

edited

Loading

jart commented Jul 22, 2024

jart left a comment

Stillerman commented Jul 23, 2024

jart commented Jul 23, 2024

Supports SmolLM #495

Supports SmolLM #495

Conversation

Stillerman commented Jul 21, 2024 • edited Loading

jart commented Jul 22, 2024

Stillerman commented Jul 22, 2024

Stillerman commented Jul 22, 2024 • edited Loading

jart commented Jul 22, 2024

jart left a comment

Choose a reason for hiding this comment

Stillerman commented Jul 23, 2024

jart commented Jul 23, 2024

Stillerman commented Jul 21, 2024 •

edited

Loading

Stillerman commented Jul 22, 2024 •

edited

Loading