Deploy your favourite LLM model onto kubernetes

In this repo we show you how you can use ollama to easily run llms like llama3 and qwen, as well as how you can easily create chat interface for your model using gradio, not only that but how you can enable your mdel inference or training to access GPU on your kubernetes cluster.

Setup a kubernetes cluster

We use k3s with a default configuration, you can install it by a one liner.

curl -sfL https://get.k3s.io | sh -

Setup Access for nvidia GPU

Install Nvidia Container Toolkit.

Make sure that you have the nvidia runtime class!

➜  root git:(main) ✗ kubectl get runtimeclass | grep nvidia
   nvidia                nvidia                15d

Deploy ollama server

We gonna use ollama/ollama image from docker hub to deploy it as an llm manager.

Deploy it to the cluster with onechart as follow:

helm repo add onechart https://chart.onechart.dev && helm repo update

helm install llm-manager onechart/onechart \
  --set image.repository=ollama/ollama \
  --set image.tag=latest \
  --set containerPort=11434 \
  --set podSpec.runtimeClassName=nvidia

Now we have an llm-manager in which we can query any opensource model we want like llama3, phi and more! You can find the all the vailable llm models provided by ollama on their official website

Ollama is optimized for both CPU and gpu and the detection will be done automatically.

In case that your GPU is not enough for certain models it will try to use CPU and Memory.

Setup a chatbot powered by llama3 and gradio

We have setup a small python script chatbot.py that give you a web chat interface that allow you to chose your llm model that you want to experiment with.

Deply the chatbot to kuberentes as follow:

helm install chatbot onechart/onechart \
  --set image.repository=ghcr.io/biznesbees/chatbot-v0.1.0 \
  --set image.tag=latest \
  --set vars.OLLAMA_HOST=http://llm-manager \

the queries are sent to the llm-manager
if the model requested is not available it will be downloaded from ollama library.
then the inference will be pocceds afterwards.

The Big picture

The following diagram show all the componenets and how they interacts.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
chatbot.py		chatbot.py
diagram.png		diagram.png
llama_input.png		llama_input.png
llama_output.png		llama_output.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploy your favourite LLM model onto kubernetes

Setup a kubernetes cluster

Setup Access for nvidia GPU

Deploy ollama server

Setup a chatbot powered by llama3 and gradio

The Big picture

About

Releases 1

Packages

Languages

youcefguichi/LLM-on-your-laptop

Folders and files

Latest commit

History

Repository files navigation

Deploy your favourite LLM model onto kubernetes

Setup a kubernetes cluster

Setup Access for nvidia GPU

Deploy ollama server

Setup a chatbot powered by llama3 and gradio

The Big picture

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages