Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local models #105

Open
mudler opened this issue Apr 10, 2023 · 5 comments
Open

Local models #105

mudler opened this issue Apr 10, 2023 · 5 comments

Comments

@mudler
Copy link

mudler commented Apr 10, 2023

Hey 👋 !

Awesome project!

I'm trying to run chatgpt-web with llama.cpp, I've created a project using golang lama.cpp bindings https://github.com/go-skynet/llama-cli which mimics the OpenAI API to be 1:1 compatible but having multi-model that can run locally instead.

It seems all to work so far, and I'd like to document the usage to have them both working together, so to use it with local models. However, I'm struggling as chatgpt-web seems to filter the model from the API with openAI available models - llama-cli returns a list of models but the filtering chatgpt-web is doing prevents to select models from the list. (e.g. alpaca can't be run unless I do some hardwiring on the API).

If you want to test it, you need to run llama-cli from the latest image built from master, like so:

./llama-cli api --address 0.0.0.0:8080 --models-path models-path-here --threads 14

And set the VITE_API_BASE accordingly in the .env file.

It would be super-cool if could work together to have the capability to load local models, maybe directly adding options to run it aside with docker-compose (that's what I'm currently doing!) WDYT?

@Niek
Copy link
Owner

Niek commented Apr 11, 2023

Thanks! llama-cli with the API addition sounds like a great match with ChatGPT-web!
The models don't work because we hard-code explicit supported models:

export const supportedModels = [ // See: https://platform.openai.com/docs/models/model-endpoint-compatibility
'gpt-4',
'gpt-4-0314',
'gpt-4-32k',
'gpt-4-32k-0314',
'gpt-3.5-turbo',
'gpt-3.5-turbo-0301'
]

This can be quite easily fixed though. I guess we should support everything with ggml and assume a $0 cost for these models. The model selection need some work in any case. I tested with ggml-vicuna-7b-4bit and it worked well, although the output was gibberish.

Are you planning on adding streaming support to the API as well (using EventSource/SSE)?

@mudler
Copy link
Author

mudler commented Apr 11, 2023

Thanks! llama-cli with the API addition sounds like a great match with ChatGPT-web! The models don't work because we hard-code explicit supported models:

export const supportedModels = [ // See: https://platform.openai.com/docs/models/model-endpoint-compatibility
'gpt-4',
'gpt-4-0314',
'gpt-4-32k',
'gpt-4-32k-0314',
'gpt-3.5-turbo',
'gpt-3.5-turbo-0301'
]

This can be quite easily fixed though. I guess we should support everything with ggml and assume a $0 cost for these models. The model selection need some work in any case.

Yup, managed to find that bit, so I was wondering what direction to take ( I don't like forking! ), however that sounds good here! I'd be more than happy then to provide a docker-compose file as well in llama-cli to redirect the users directly to chatgpt-web!

I tested with ggml-vicuna-7b-4bit and it worked well, although the output was gibberish.

It needs a prompt to be injected in each call, I've just updated the docs on the API to achieve that!
https://github.com/go-skynet/llama-cli#web-interface : TLDR; just add a corresponding "model-file-name.bin.tmpl" with the default prompt, for instance:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

(but for vicuna/chat I think it would be slightly different)

Are you planning on adding streaming support to the API as well (using EventSource/SSE)?

This comes with a high computational cost, so I'm not really going into that direction for now - CGO calls are really expensive, and if we want to stream token-by-token by calling behind the scene C functions direcly in go, that will likely bump response time by quite a lot.

@mkellerman
Copy link

Guys, i just wanna say thanks! This is a beautiful collaboration between two amazing projects!

@mkellerman
Copy link

In response to the models, i think we need to let the user add endpoints, instead of a since 'openai' url.

You want to use openai/gpt-4, you select the model from the drop down, and hit [+] to add a custom endpoint. and a custom return object.

And just give enough info in the docs on how to POST/GET from the custom endpoints.

@mudler
Copy link
Author

mudler commented Apr 12, 2023

Re: token streaming JFYI is being tracked on go-skynet/go-llama.cpp#4, however I still think that would incur in a high computational cost decreasing the overall performance, but I'll be glad to take a stab at it next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants