-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama cpp #18
Llama cpp #18
Conversation
Thank you, this is so awesome! I have been meaning to look into llama support myself, but many other things are important right now, too. I will look into this more closely tomorrow, hopefully. I think we can find a good fix for the tokenizer, but the HF workaround is also good to get things running for now. Thank you so much. |
Just a heads up, I just bumped main to |
@lbeurerkellner I did bump it to For this PR though, does this matter? It looked like with 3736fcb you were moving towards a different design where this "proxy" tokenizer would not be needed anymore. |
@lbeurerkellner Since ggml is making great progress every week and is already providing stellar performance on Apple Silicon and GPUs: Is there any update on finalizing and merging this PR? |
Yes. We have recently simplified backend infrastructure greatly, making an integration of llama.cpp much easier. Before that however, we first need to also fix the integration of a Llama tokenizer. The HF one seems to behave in a non-standard way, which needs special handling, e.g. see #95. In general, though, I am actually quite excited about adding support and I will def try to push this more. |
I just merged llama.cpp support in #111. It leverages the new backend infrastructure, which makes it rather easy to support now. This gives us an initial implementation, however, I think work here definitely could continue to improve support wrt. caching and enable batch_size > 1 on the backend. Closing this now. Nevertheless, thanks so much for the proposal and initial implementation. It was of great help doing this port to the updated infrastructure. |
Description:
This is a rough attempt at adding support for llama.cpp. It adds a separate model serve while breaking out a couple shared interfaces between the llama serve and HF serve. Though far from perfect, perhaps with some suggestions I can get this to a proper state if desired.
What the changes amount to:
An example command to run with llama.cpp looks like the following
lmql serve-model AlekseyKorshuk/vicuna-7b --llama.cpp --model_path=../ggml-vicuna-7b-1.0-uncensored-q4_0.bin --n_threads=8
The current problem I have not gotten around is that
lmql.runtime.tokenizer
requires some tokenizer to load tangentially to the serve and I'm not sure what the best way to make a tokenizer just from the ggml that is compatible with the LMQLTokenizer would be. So, for now, I have just bootstrapped themodel
arg to require a HF tokenizer that can pair with the llama.cpp ggml.