In this repo we show you how you can use ollama to easily run llms like llama3 and qwen, as well as how you can easily create chat interface for your model using gradio, not only that but how you can enable your mdel inference or training to access GPU on your kubernetes cluster.
We use k3s with a default configuration, you can install it by a one liner.
curl -sfL https://get.k3s.io | sh -
Install Nvidia Container Toolkit.
Make sure that you have the nvidia
runtime class!
➜ root git:(main) ✗ kubectl get runtimeclass | grep nvidia
nvidia nvidia 15d
We gonna use ollama/ollama
image from docker hub to deploy it as an llm manager.
Deploy it to the cluster with onechart as follow:
helm repo add onechart https://chart.onechart.dev && helm repo update
helm install llm-manager onechart/onechart \
--set image.repository=ollama/ollama \
--set image.tag=latest \
--set containerPort=11434 \
--set podSpec.runtimeClassName=nvidia
Now we have an llm-manager in which we can query any opensource model we want like llama3, phi and more! You can find the all the vailable llm models provided by ollama on their official website
Ollama is optimized for both CPU and gpu and the detection will be done automatically.
In case that your GPU is not enough for certain models it will try to use CPU and Memory.
We have setup a small python script chatbot.py
that give you a web chat interface that allow you to chose your llm model that you want to experiment with.
Deply the chatbot to kuberentes as follow:
helm install chatbot onechart/onechart \
--set image.repository=ghcr.io/biznesbees/chatbot-v0.1.0 \
--set image.tag=latest \
--set vars.OLLAMA_HOST=http://llm-manager \
- the queries are sent to the llm-manager
- if the model requested is not available it will be downloaded from ollama library.
- then the inference will be pocceds afterwards.
The following diagram show all the componenets and how they interacts.