Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMa.cpp support #49

Open
Alex20129 opened this issue Dec 11, 2024 · 5 comments
Open

LLaMa.cpp support #49

Alex20129 opened this issue Dec 11, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@Alex20129
Copy link

I'm currently playing with LLaMa.cpp (qwen-instruct GGUF model) in console chat mode. And I wondered if it would be possible to seamlessly integrate LLaMa.cpp with Qt Creator? That's how I got here.

i would like to try QodeAssist with LLaMa.cpp as a backend.

@Alex20129
Copy link
Author

According to README, LLaMa.cpp server has a special infill endpoint for code 'infilling'.

@Palm1r
Copy link
Owner

Palm1r commented Dec 11, 2024

It is good idea, and thank you for information. Yeah, it is possible to use llama.cpp like backend, but one things. For now QodeAssist works only with FIM model for code completion. I am working for extending that by instruct models. Maybe today or tomorrow I will finish. And I am waiting to QtCreator 15.0.1. Because Qt has already shared only 15.0.1 and I can't build 15.0.0 by github actions for all platforms. If you have time and patience, then wait and I will add everything and release it in the next version

@Palm1r Palm1r added the enhancement New feature or request label Dec 11, 2024
@Palm1r
Copy link
Owner

Palm1r commented Dec 19, 2024

I am back to this, @Alex20129 do you have a model or link to model which support FIM? I need for testing.
Because for now if I test by terminal, I have like these
curl --request POST
--url http://localhost:8080/infill
--header "Content-Type: application/json"
--data '{
"prompt": "bla",
"input_prefix": "def hello_",
"input_suffix": "():\n return "Hello World"",
"temperature": 0.8
}'
{"error":{"code":501,"message":"Infill is not supported by this model: prefix token is missing. suffix token is missing. middle token is missing. ","type":"not_supported_error"}}%

@Alex20129
Copy link
Author

IDK how to test FIM function properly, because i've only used it in chat mode. Anyway, here is the models:
https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-GGUF/tree/main
im using qwen-2.5-coder-32b-instruct-q8_0.gguf model.
And when i run LLaMa.CPP i can see this:

...
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 32,76 B
llm_load_print_meta: model size       = 32,42 GiB (8,50 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 Coder 32B Instruct AWQ
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
...

which makes me think that the "qwen-coder" itself and all its derivatives were specifically trained with FIM objective in mind.

@Alex20129
Copy link
Author

I also tried running rombos-coder-2.5-qwen-32b-q5_k_s.gguf which is compact and can be fully loaded into GPU, but I'm not sure about its quality since it's a derivative model mixed by some enthusiast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants