Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes example #6546

Open
phymbert opened this issue Apr 8, 2024 · 16 comments
Open

kubernetes example #6546

phymbert opened this issue Apr 8, 2024 · 16 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed kubernetes Helm & Kubernetes server/webui

Comments

@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

Motivation

Kubernetes is widely used in the industry to deploy product and application at scale.

It can be useful for the community to have a llama.cpp helm chart for the server.

I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:

https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes

References

@phymbert phymbert added enhancement New feature or request server/webui kubernetes Helm & Kubernetes help wanted Extra attention is needed labels Apr 8, 2024
@OmegAshEnr01n
Copy link

Hi! I will take this up!

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 10, 2024

Great @OmegAshEnr01n , few notes:

  • I think we need 2 subcharts, one for embeddings, one for generation/completions
  • probably need to update the schema in my branch as now the model will be downloaded by the server directly, and the related Job should be removed
  • need to support both HF url parameters and raw url for internal model repo like artifactory
  • metrics scrapping must work for prometheus community (with the resourcePodMonitoring), enterprise and ideally dynatrace
  • pvc must stay after the helm is un-installed
  • auto scalling can be done later on, but this is a must have
  • ideally the helm must be built by the CI and installable from gh-pages

Ping here if you have question, good luck ! Excited to use it.

@phymbert
Copy link
Collaborator Author

Hi @OmegAshEnr01n, are you still working on this issue ?

@OmegAshEnr01n
Copy link

Yes, still am. Will share a pull request over the weekend when completed.

@OmegAshEnr01n
Copy link

OmegAshEnr01n commented Apr 25, 2024

Hi @phymbert

What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      {{- range $i, $container := .Values.containers }}
      - name: my-container-{{ $i }}
        image: {{ $container.image }}
        volumeMounts:
        - name: data-volume-{{ $i }}
          mountPath: /data
      {{- end }}
      volumes:
      {{- range $i, $container := .Values.containers }}
      - name: data-volume-{{ $i }}
        persistentVolumeClaim:
          claimName: pvc-{{ $i }}
      {{- end }}

On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well.

@phymbert
Copy link
Collaborator Author

Embeddings model are different from the generative ones. In an RAG setup you need two models.

Prometheus is not required but if present metrics are exported.

@OmegAshEnr01n
Copy link

Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong.

@phymbert
Copy link
Collaborator Author

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on.
There is nothing to do with the server code.

@ceddybi
Copy link

ceddybi commented May 3, 2024

@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀

@OmegAshEnr01n
Copy link

Not yet. Currently testing it on a personal kube cluster with separate node selectors.

@Perdjesk
Copy link

Perdjesk commented Jun 20, 2024

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

/cc @mcharytoniuk

@mcharytoniuk
Copy link
Contributor

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

Thanks the mention. I maintain that point. Of course round robin will work. "Least connections" will be better (but it does not have to reflect how many slots are being used), but the issue is - prompts can take a long, varying time to finish. With round robin it is very possible to distribute the load unevenly (for example if one of the servers was unlucky and is still processing a few of huge prompts). To me the ideal is balancing based on slots and have some requests queue on top of that (which I plan to add to paddler btw :)). I love the slots idea because they make the infra really predictable.

@phymbert
Copy link
Collaborator Author

phymbert commented Jul 7, 2024

@phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

Firstly, it's better to use native llama.cpp KV cache, so if you have k8s nodes with 2-4 A/H100, having one pod per node using all VRAM and as many as possible slots/cache for the server will give you the maximum performance, but not HA.
Then, regarding load balancing, I tested both IP affinity, rb and least conn., no significant differences found. I think it depends of the dataset/usecase or client distribution.

Maybe an interesting approach would be to prioritize upfront based on input tokens size. Nonetheless you cannot predict output tokens size.

I mainly faced issues with long living http connections, IMHO we need a better architecture for this than SSE.

@OmegAshEnr01n
Copy link

@phymbert ive made a pull request.

@phymbert
Copy link
Collaborator Author

phymbert commented Jul 22, 2024

ive made a pull request.

The PR is on my fork:

phymbert#7

We need to bring it here somehow

@anencore94
Copy link

Hope to meet soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed kubernetes Helm & Kubernetes server/webui
Projects
None yet
Development

No branches or pull requests

6 participants