[Feature] GGUF support #1616

remixer-dec · 2024-10-09T05:45:17Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi! Since .gguf format is already supported by vLLM, is it be possible to add support for it in SGLang server?

Related resources

No response

merrymercy · 2024-10-11T14:39:33Z

It should be easy. Could you give us an example command you want us to have?

remixer-dec · 2024-10-11T15:37:09Z

python -m sglang.launch_server --model-path /path/to/model.gguf

hahmad2008 · 2024-11-06T13:07:07Z

@remixer-dec @merrymercy
How to serve a gguf model? VLLM provided a tutorial for that like this https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html

But how to inference with sglang? @merrymercy could you please provide a command to do that?
example for this model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

XYZliang · 2024-11-08T07:11:45Z

python -m sglang.launch_server --model-path /path/to/model.gguf

If you haven't tried it, please don't reply.. This doesn't work at all.

python -m sglang.launch_server --model-path /home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf --port 30000 --mem-fraction-static 0.8
WARNING 11-08 15:09:00 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:08] server_args=ServerArgs(model_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=879353602, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:22] Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 216, in run_detokenizer_process
    manager = DetokenizerManager(server_args, port_args)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 72, in __init__
    self.tokenizer = get_tokenizer(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 129, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 844, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

remixer-dec · 2024-11-08T08:06:21Z

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

XYZliang · 2024-11-09T15:23:19Z

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

Sorry, I didn't pay attention to the users who commented...

whk6688 · 2024-11-14T03:49:09Z

not work. error is:

(python311) whk@VM-2-13-ubuntu:~/code/qwen25-3b$ python -m sglang.launch_server --model-path Qwen2.5-3B-Instruct-q5_k_m.gguf --port 8075 --host 0.0.0.0 --mem-fraction-static 0.2 --chat-template template.json
[2024-11-14 11:42:24] server_args=ServerArgs(model_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Qwen2.5-3B-Instruct-q5_k_m.gguf', chat_template='template.json', is_embedding=False, host='0.0.0.0', port=8075, mem_fraction_static=0.2, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=995653076, constrained_json_whitespace_pattern=None, watchdog_timeout=300, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Traceback (most recent call last):
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 668, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 771, in _dict_from_json_file
text = reader.read()
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 8: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 16, in
raise e
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 14, in
launch_server(server_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 457, in launch_server
launch_engine(server_args=server_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 429, in launch_engine
tokenizer_manager = TokenizerManager(server_args, port_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 103, in init
self.model_config = ModelConfig(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/configs/model_config.py", line 46, in init
self.hf_config = get_config(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 66, in get_config
config = AutoConfig.from_pretrained(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1017, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at 'Qwen2.5-3B-Instruct-q5_k_m.gguf' is not a valid JSON file.

merrymercy · 2024-11-15T01:27:58Z

It should be easy to support. Contributions are welcome! Or you can convert that to HF format.

zhengy001 · 2024-11-27T07:27:52Z

Please take a look at this PR: #2215

merrymercy · 2024-12-01T10:51:57Z

supported by #2215

remixer-dec · 2024-12-01T12:23:05Z

Let's go! Thank you!

remixer-dec · 2024-12-06T01:10:01Z

Just updated SGLang and tried to load GGUF models:
The output quality differs a lot from llama.cpp with the same model. It just kees outputting nonsense in SGLang.

remixer-dec · 2024-12-06T01:14:23Z

The same model loaded in vLLM works totally fine as well

zhengy001 · 2024-12-08T07:17:27Z

Just updated SGLang and tried to load GGUF models: The output quality differs a lot from llama.cpp with the same model. It just kees outputting nonsense in SGLang.

@remixer-dec
Applying this patch should be good

remixer-dec · 2024-12-09T00:55:35Z

@zhengy001 it is better (at least no collapse), but it keeps generating text without ever stopping (by default)

P.S. if you manually pass stop sequence

</s>

in each request, it does stop correctly, but such information should be loaded from model metadata and, if specified, from --chat-template template.

Currently when custom chat template is specified:
for GPTQ model:

vs GGUF:

zhengy001 · 2024-12-13T06:01:44Z

@zhengy001 it is better (at least no collapse), but it keeps generating text without ever stopping (by default) P.S. if you manually pass stop sequence
in each request, it does stop correctly, but such information should be loaded from model metadata and, if specified, from --chat-template template. Currently when custom chat template is specified: for GPTQ model: vs GGUF:

Model EOS is not loaded correctly.

Pls check this PR
#2475

merrymercy mentioned this issue Nov 14, 2024

[Feature] How to serve GGUF model? #1937

Closed

merrymercy added the good first issue Good for newcomers label Nov 14, 2024

zhengy001 mentioned this issue Nov 27, 2024

[FEAT] Support GGUF format #2215

Merged

3 tasks

merrymercy closed this as completed Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] GGUF support #1616

[Feature] GGUF support #1616

remixer-dec commented Oct 9, 2024

merrymercy commented Oct 11, 2024 •

edited

Loading

remixer-dec commented Oct 11, 2024

hahmad2008 commented Nov 6, 2024

XYZliang commented Nov 8, 2024

remixer-dec commented Nov 8, 2024

XYZliang commented Nov 9, 2024

whk6688 commented Nov 14, 2024 •

edited

Loading

merrymercy commented Nov 15, 2024

zhengy001 commented Nov 27, 2024

merrymercy commented Dec 1, 2024

remixer-dec commented Dec 1, 2024

remixer-dec commented Dec 6, 2024 •

edited

Loading

remixer-dec commented Dec 6, 2024

zhengy001 commented Dec 8, 2024

remixer-dec commented Dec 9, 2024 •

edited

Loading

zhengy001 commented Dec 13, 2024 •

edited

Loading

[Feature] GGUF support #1616

[Feature] GGUF support #1616

Comments

remixer-dec commented Oct 9, 2024

Checklist

Motivation

Related resources

merrymercy commented Oct 11, 2024 • edited Loading

remixer-dec commented Oct 11, 2024

hahmad2008 commented Nov 6, 2024

XYZliang commented Nov 8, 2024

remixer-dec commented Nov 8, 2024

XYZliang commented Nov 9, 2024

whk6688 commented Nov 14, 2024 • edited Loading

merrymercy commented Nov 15, 2024

zhengy001 commented Nov 27, 2024

merrymercy commented Dec 1, 2024

remixer-dec commented Dec 1, 2024

remixer-dec commented Dec 6, 2024 • edited Loading

remixer-dec commented Dec 6, 2024

zhengy001 commented Dec 8, 2024

remixer-dec commented Dec 9, 2024 • edited Loading

zhengy001 commented Dec 13, 2024 • edited Loading

merrymercy commented Oct 11, 2024 •

edited

Loading

whk6688 commented Nov 14, 2024 •

edited

Loading

remixer-dec commented Dec 6, 2024 •

edited

Loading

remixer-dec commented Dec 9, 2024 •

edited

Loading

zhengy001 commented Dec 13, 2024 •

edited

Loading