-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a new foundation for Dev Services like experience #557
Conversation
007f9e1
to
e8deb32
Compare
I am going to mark this as ready for review as I believe that it's good to go as is and the OpenAI compatibility (which is very limited in Ollama currently anyway) can be introduced in a follow up. |
...a/deployment/src/main/java/io/quarkiverse/langchain4j/ollama/deployment/OllamaProcessor.java
Outdated
Show resolved
Hide resolved
So idea is allowing to use ollama, podman ai lab, ai studio etc. To "host" same/closest appproximation model? And not necessarily tied to a container runtime (docker/podman) ? Definitely +1 :) |
The basic idea behind this is that the LLM model being run and the API being exposed is not always tightly couple - for example the Mistral models can be run in Ollama, not just in Mistral's SaaS offering. This PR sets the foundation for having an inference server (for now only Ollama) be able to run inference for the LLM models the user configured. In the Ollama implementation we are able to instruct Ollama to pull the selected model name(if it exists and if it is not already present) and make the necessary configuration behind the scenes so each configured LangChain4j chat model configuration will use the inference server. The following remains to be done: * Use the same idea for the rest of the LangChain4j models (like embedding models) * Use Ollama's OpenAI compatibility mode to support the OpenAI extension
Another interesting fact is that InstructLab supports serving the trained models using an OpenAI compatible API. |
@cescoffier have you had a chance to test this one? |
I'm not sure how I should be using it. I tried with ollama (which works OOTB for me - nothing to do), The dev services was ignored. Also, it seems that there is some API mismatch: Ollama is using |
You need to update the version if you are using the samples |
At the time being, I have not used the OpenAI compatibility stuff for reasons I have explained above |
If I use:
it works, but it was working already (the model is already pulled and the server is running). What change should I expect? |
You can try to use something like |
Ah ok, it pulls the model, that's what I was missing! |
Yeah, and it's the dev experience one expects where you only need to configure what is necessary (unlike the existing dev service where you need multiple things configured) |
Hum, something seems to be broken. It pulls the model, but then the application was not calling it - actually nothing was called: Listening for transport dt_socket at address: 5005 --/ __ / / / / _ | / _ / /// / / / __/ Tried to invoke the model ....2024-05-13 16:00:59,858 INFO [io.quarkus] (Shutdown thread) quarkus-langchain4j-sample-review-triage stopped in 0.008s Worked with a restart. |
BTw, when I said worked, it is a bit weird. The response was the following:
|
That's an issue with the model though, no? |
@geoand Removed llama3 and it re-pulled it and then it worked perfectly! |
If things work as expected for you, I would like to get this in so I can proceed to improve on it later without having the PR go stale (as it has a lot small changes) |
The basic idea behind this is that the LLM model being run and the API being exposed is not always tightly couple - for example the Mistral models can be run in Ollama, not just in Mistral's SaaS offering.
This PR sets the foundation for having an inference server (for now only Ollama) be able to run inference for the LLM models the user configured.
In the Ollama implementation we are able to instruct Ollama to pull the selected model name(if it exists and if it is not already present) and make the necessary
configuration behind the scenes so each configured LangChain4j chat model configuration will use the inference server.
It is important to note that for this to work the configuration of the model to be used must know become a build time property (which probably makes sense regardless).
The following remains to be done:
Use the same idea for the rest of the LangChain4j models (like embedding models)Update the documentationThe way the PR has been done, this would allow other inference servers to be added in the future with minimal changes (the most significant of which would be a way to resolve the conflict where multiple servers can serve a model)