Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a new foundation for Dev Services like experience #557

Merged
merged 1 commit into from
May 13, 2024
Merged

Conversation

geoand
Copy link
Collaborator

@geoand geoand commented May 9, 2024

The basic idea behind this is that the LLM model being run and the API being exposed is not always tightly couple - for example the Mistral models can be run in Ollama, not just in Mistral's SaaS offering.
This PR sets the foundation for having an inference server (for now only Ollama) be able to run inference for the LLM models the user configured.
In the Ollama implementation we are able to instruct Ollama to pull the selected model name(if it exists and if it is not already present) and make the necessary
configuration behind the scenes so each configured LangChain4j chat model configuration will use the inference server.

It is important to note that for this to work the configuration of the model to be used must know become a build time property (which probably makes sense regardless).

The following remains to be done:

  • Use the same idea for the rest of the LangChain4j models (like embedding models)
  • Use Ollama's OpenAI compatibility mode to support the OpenAI extension
  • Update the documentation

The way the PR has been done, this would allow other inference servers to be added in the future with minimal changes (the most significant of which would be a way to resolve the conflict where multiple servers can serve a model)

@geoand geoand requested a review from a team as a code owner May 9, 2024 12:42
@geoand geoand marked this pull request as draft May 9, 2024 12:42
@geoand geoand force-pushed the #505 branch 8 times, most recently from 007f9e1 to e8deb32 Compare May 9, 2024 15:10
@geoand geoand marked this pull request as ready for review May 9, 2024 15:11
@geoand
Copy link
Collaborator Author

geoand commented May 9, 2024

I am going to mark this as ready for review as I believe that it's good to go as is and the OpenAI compatibility (which is very limited in Ollama currently anyway) can be introduced in a follow up.

@maxandersen
Copy link
Member

So idea is allowing to use ollama, podman ai lab, ai studio etc. To "host" same/closest appproximation model?

And not necessarily tied to a container runtime (docker/podman) ?

Definitely +1 :)

The basic idea behind this is that the LLM model being run and the
API being exposed is not always tightly couple - for example
the Mistral models can be run in Ollama, not just in Mistral's
SaaS offering.
This PR sets the foundation for having an inference server
(for now only Ollama) be able to run inference for
the LLM models the user configured.
In the Ollama implementation we are able to instruct Ollama
to pull the selected model name(if it exists and if
it is not already present) and make the necessary
configuration behind the scenes so each configured LangChain4j
chat model configuration will use the inference server.

The following remains to be done:
* Use the same idea for the rest of the LangChain4j models
(like embedding models)
* Use Ollama's OpenAI compatibility mode to support the OpenAI extension
@geoand
Copy link
Collaborator Author

geoand commented May 10, 2024

Another interesting fact is that InstructLab supports serving the trained models using an OpenAI compatible API.

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

@cescoffier have you had a chance to test this one?
I'm really interested in your experience with it as well (not least because you opened the original issue).

@cescoffier
Copy link
Collaborator

I'm not sure how I should be using it.

I tried with ollama (which works OOTB for me - nothing to do), The dev services was ignored.

Also, it seems that there is some API mismatch:

Ollama is using http://localhost:53977/v1/api/chat while I would have expected: http://localhost:53977/v1/chat/completions

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

You need to update the version if you are using the samples

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

Ollama is using http://localhost:53977/v1/api/chat while I would have expected: http://localhost:53977/v1/chat/completions

At the time being, I have not used the OpenAI compatibility stuff for reasons I have explained above

@cescoffier
Copy link
Collaborator

If I use:

quarkus.langchain4j.ollama.chat-model.temperature=0.8
quarkus.langchain4j.openai.timeout=60s

it works, but it was working already (the model is already pulled and the server is running). What change should I expect?

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

You can try to use something like quarkus.langchain4j.ollama.chat-model.model-id=phi3 in order to work with a new model (the default is llama3 which I guess you already have locally)

@cescoffier
Copy link
Collaborator

Ah ok, it pulls the model, that's what I was missing!

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

Yeah, and it's the dev experience one expects where you only need to configure what is necessary (unlike the existing dev service where you need multiple things configured)

@cescoffier
Copy link
Collaborator

Hum, something seems to be broken.

It pulls the model, but then the application was not calling it - actually nothing was called:

Listening for transport dt_socket at address: 5005


--/ __ / / / / _ | / _ / /// / / / __/
-/ /
/ / // / __ |/ , / ,< / // /\ \
--___
// |//|//||_//
2024-05-13 15:58:38,675 WARN [io.qua.config] (Quarkus Main Thread) Unrecognized configuration key "quarkus.langchain4j.openai.timeout" was provided; it will be ignored; verify that the dependency extension for this configuration is set or that you did not make a typo
2024-05-13 15:58:39,043 INFO [io.quarkus] (Quarkus Main Thread) quarkus-langchain4j-sample-review-triage 1.0-SNAPSHOT on JVM (powered by Quarkus 3.8.2) started in 72.372s. Listening on: http://localhost:8080
2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated.
2024-05-13 15:58:39,044 INFO [io.quarkus] (Quarkus Main Thread) Installed features: [cdi, langchain4j, langchain4j-ollama, qute, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertex]

Tried to invoke the model ....

2024-05-13 16:00:59,858 INFO [io.quarkus] (Shutdown thread) quarkus-langchain4j-sample-review-triage stopped in 0.008s

Worked with a restart.

@cescoffier
Copy link
Collaborator

BTw, when I said worked, it is a bit weird. The response was the following:

- body: {"model":"phi3","created_at":"2024-05-13T14:01:17.342015Z","message":{"role":"assistant","content":"{\n  \"evaluation\": \"POSITIVE\",\n  \"message\": \"Thank you for your kind words!\"\n}\n\nTo determine the sentiment analysis and language identification programmatically, one would typically use a natural language processing library capable of multi-language support, such as `spaCy`, with additional preprocessing to handle transliterations or direct translation equivalents. However, since this is an example response format without actual code execution, I've manually categorized the sentiment and provided an appropriate message in English for clarity."},"done_reason":"stop","done":true,"total_duration":6125478875,"load_duration":2639789458,"prompt_eval_count":348,"prompt_eval_duration":881733000,"eval_count":112,"eval_duration":2595494000}

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

That's an issue with the model though, no?

@cescoffier
Copy link
Collaborator

@geoand Removed llama3 and it re-pulled it and then it worked perfectly!

@geoand
Copy link
Collaborator Author

geoand commented May 13, 2024

If things work as expected for you, I would like to get this in so I can proceed to improve on it later without having the PR go stale (as it has a lot small changes)

@geoand geoand merged commit edce572 into main May 13, 2024
12 checks passed
@geoand geoand deleted the #505 branch May 13, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants