Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] GGUF support #1616

Closed
2 tasks done
remixer-dec opened this issue Oct 9, 2024 · 16 comments
Closed
2 tasks done

[Feature] GGUF support #1616

remixer-dec opened this issue Oct 9, 2024 · 16 comments
Labels
good first issue Good for newcomers

Comments

@remixer-dec
Copy link

Checklist

Motivation

Hi! Since .gguf format is already supported by vLLM, is it be possible to add support for it in SGLang server?

Related resources

No response

@merrymercy
Copy link
Contributor

merrymercy commented Oct 11, 2024

It should be easy. Could you give us an example command you want us to have?

@remixer-dec
Copy link
Author

python -m sglang.launch_server --model-path /path/to/model.gguf

@hahmad2008
Copy link

@remixer-dec @merrymercy
How to serve a gguf model? VLLM provided a tutorial for that like this https://docs.vllm.ai/en/latest/getting_started/examples/gguf_inference.html

But how to inference with sglang? @merrymercy could you please provide a command to do that?
example for this model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

@XYZliang
Copy link

XYZliang commented Nov 8, 2024

python -m sglang.launch_server --model-path /path/to/model.gguf

If you haven't tried it, please don't reply.. This doesn't work at all.

python -m sglang.launch_server --model-path /home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf --port 30000 --mem-fraction-static 0.8
WARNING 11-08 15:09:00 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:08] server_args=ServerArgs(model_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_path='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=879353602, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 11-08 15:09:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
[2024-11-08 15:09:22] Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 216, in run_detokenizer_process
    manager = DetokenizerManager(server_args, port_args)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/managers/detokenizer_manager.py", line 72, in __init__
    self.tokenizer = get_tokenizer(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 129, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 844, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/opt/anaconda3/envs/BDCI2024/lib/python3.10/site-packages/transformers/utils/hub.py", line 469, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: '/home/xyzliang/data/project/oil/gguf/v3_merge_Q4_K_M.gguf'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

@remixer-dec
Copy link
Author

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

@XYZliang
Copy link

XYZliang commented Nov 9, 2024

@XYZliang maybe read the previous comment? It's not supposed to be working, I replied the command that I was asked for

Sorry, I didn't pay attention to the users who commented...

@whk6688
Copy link

whk6688 commented Nov 14, 2024

not work. error is:

(python311) whk@VM-2-13-ubuntu:~/code/qwen25-3b$ python -m sglang.launch_server --model-path Qwen2.5-3B-Instruct-q5_k_m.gguf --port 8075 --host 0.0.0.0 --mem-fraction-static 0.2 --chat-template template.json
[2024-11-14 11:42:24] server_args=ServerArgs(model_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_path='Qwen2.5-3B-Instruct-q5_k_m.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Qwen2.5-3B-Instruct-q5_k_m.gguf', chat_template='template.json', is_embedding=False, host='0.0.0.0', port=8075, mem_fraction_static=0.2, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=995653076, constrained_json_whitespace_pattern=None, watchdog_timeout=300, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Traceback (most recent call last):
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 668, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 771, in _dict_from_json_file
text = reader.read()
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 8: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 16, in
raise e
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/launch_server.py", line 14, in
launch_server(server_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 457, in launch_server
launch_engine(server_args=server_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/server.py", line 429, in launch_engine
tokenizer_manager = TokenizerManager(server_args, port_args)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/managers/tokenizer_manager.py", line 103, in init
self.model_config = ModelConfig(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/configs/model_config.py", line 46, in init
self.hf_config = get_config(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/sglang/srt/hf_transformers_utils.py", line 66, in get_config
config = AutoConfig.from_pretrained(
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1017, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/wanghaikuan/anaconda3/envs/python311/lib/python3.10/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at 'Qwen2.5-3B-Instruct-q5_k_m.gguf' is not a valid JSON file.

@merrymercy
Copy link
Contributor

It should be easy to support. Contributions are welcome! Or you can convert that to HF format.

@zhengy001
Copy link
Contributor

Please take a look at this PR: #2215

@merrymercy
Copy link
Contributor

supported by #2215

@remixer-dec
Copy link
Author

Let's go! Thank you!

@remixer-dec
Copy link
Author

remixer-dec commented Dec 6, 2024

Just updated SGLang and tried to load GGUF models:
The output quality differs a lot from llama.cpp with the same model. It just kees outputting nonsense in SGLang.
image

@remixer-dec
Copy link
Author

The same model loaded in vLLM works totally fine as well
image

@zhengy001
Copy link
Contributor

Just updated SGLang and tried to load GGUF models: The output quality differs a lot from llama.cpp with the same model. It just kees outputting nonsense in SGLang. image

@remixer-dec
Applying this patch should be good
patch

@remixer-dec
Copy link
Author

remixer-dec commented Dec 9, 2024

@zhengy001 it is better (at least no collapse), but it keeps generating text without ever stopping (by default)
after-patch
P.S. if you manually pass stop sequence

</s>
in each request, it does stop correctly, but such information should be loaded from model metadata and, if specified, from --chat-template template.

Currently when custom chat template is specified:
for GPTQ model:
GPTQ
vs GGUF:
GGUF

@zhengy001
Copy link
Contributor

zhengy001 commented Dec 13, 2024

@zhengy001 it is better (at least no collapse), but it keeps generating text without ever stopping (by default) after-patch P.S. if you manually pass stop sequence

in each request, it does stop correctly, but such information should be loaded from model metadata and, if specified, from --chat-template template. Currently when custom chat template is specified: for GPTQ model: GPTQ vs GGUF: GGUF

Model EOS is not loaded correctly.

Pls check this PR
#2475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants