Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama cpp #18

Closed
wants to merge 10 commits into from
Closed

Llama cpp #18

wants to merge 10 commits into from

Conversation

CircArgs
Copy link

@CircArgs CircArgs commented Apr 14, 2023

Description:

This is a rough attempt at adding support for llama.cpp. It adds a separate model serve while breaking out a couple shared interfaces between the llama serve and HF serve. Though far from perfect, perhaps with some suggestions I can get this to a proper state if desired.

What the changes amount to:

An example command to run with llama.cpp looks like the following
lmql serve-model AlekseyKorshuk/vicuna-7b --llama.cpp --model_path=../ggml-vicuna-7b-1.0-uncensored-q4_0.bin --n_threads=8

The current problem I have not gotten around is that lmql.runtime.tokenizer requires some tokenizer to load tangentially to the serve and I'm not sure what the best way to make a tokenizer just from the ggml that is compatible with the LMQLTokenizer would be. So, for now, I have just bootstrapped the model arg to require a HF tokenizer that can pair with the llama.cpp ggml.

@lbeurerkellner
Copy link
Collaborator

Thank you, this is so awesome! I have been meaning to look into llama support myself, but many other things are important right now, too. I will look into this more closely tomorrow, hopefully. I think we can find a good fix for the tokenizer, but the HF workaround is also good to get things running for now.

Thank you so much.

@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Apr 18, 2023

Just a heads up, I just bumped main to transformers==4.28.1, so this should bring LLamaTokenizer, etc.. But given that you were already using it, I assume you are already running on a later version of transformers. Not providing clean_up_tokenization_spaces seems to cause a bug for now, at least with GPT models. Not sure if True/False is here the right choice, I will have to investigate more.

@CircArgs
Copy link
Author

@lbeurerkellner I did bump it to transformers==4.28.1 and set clean_up_tokenization_spaces to True which seemed to fix it this is where the error originates. It's looking for the argument or from the config which are both False by default I think.

For this PR though, does this matter? It looked like with 3736fcb you were moving towards a different design where this "proxy" tokenizer would not be needed anymore.

@gottlike
Copy link

@lbeurerkellner Since ggml is making great progress every week and is already providing stellar performance on Apple Silicon and GPUs: Is there any update on finalizing and merging this PR?

@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Jun 18, 2023

Yes. We have recently simplified backend infrastructure greatly, making an integration of llama.cpp much easier. Before that however, we first need to also fix the integration of a Llama tokenizer. The HF one seems to behave in a non-standard way, which needs special handling, e.g. see #95.

In general, though, I am actually quite excited about adding support and I will def try to push this more.

@lbeurerkellner
Copy link
Collaborator

I just merged llama.cpp support in #111. It leverages the new backend infrastructure, which makes it rather easy to support now. This gives us an initial implementation, however, I think work here definitely could continue to improve support wrt. caching and enable batch_size > 1 on the backend.

Closing this now. Nevertheless, thanks so much for the proposal and initial implementation. It was of great help doing this port to the updated infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants