Skip to content

Commit

Permalink
HPA improvements (#386)
Browse files Browse the repository at this point in the history
* HPA: Support scaling each ChatQnA subcomponent separately

By moving HPA settings from Chart global section to subchart specific sections.

Fix also links in README table.

* HPA: reduce scaling of TGI CPU pod replicas

In case new pods end up on a node which already has a busy TGI
instance, that will (very) significantly slow the startup of the new
pod(s). Additional load (TGI instances) will just make issue worse.

Slowing startup helps a bit with that, along with smaller default
maxReplicas value (which is easy for user to increase, once TGI pod
has appropriate resource requests for the selected model and
underlying node HW).

Scale down being also slowed can help with smaller request fluctuations.

* Support different Prometheus installations

Fixes mismatch between README (Prometheus Helm install) and
configMap name used in adapter config (manifest install).

* Move HPA instructions to its own document

And document need for specifying resources.

* Use separate hpa-values.yaml for enabling HPA

* Add separate cpu-values.yaml for CPU timings & resources

* Rename HPA template files based on Helm best practices

https://helm.sh/docs/chart_best_practices/templates/

* Use chart specific prefix for custom metrics

So that HPA scaling rules use custom metrics for correct set of TGI/TEI instances.

* Fixes for Helm installed Prometheus version

* Use custom metrics configMap name that does NOT match one installed
  by Prometheus, because Helm does not allow overwriting object
  created by another Helm chart (like using manifests would).

* Add Prometheus release name to serviceMonitors. Otherwise Helm
  installed Prometheus does not find serviceMonitors. Alternative
  would be using:
  prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Reported by Theresa.

* Better variables for enabling timeouts + replica than HPA one

Timeout is question of Pod slowness i.e. is it accelerated, not HPA.

Replicas need to be set only when count is set to something else than
the default value (1).  This will work also for HPA, while making
GMC manifest generation easier.

* Document how to replace existing custom metrics

How to install Prometheus with Helm and fix doc issues.

Having a manual step that replaces existing PrometheusAdapter config
with the generated one should work in all cases (it's a workaround
for Helm refusing to overwrite object created by another chart).

* GMC manifests: fix deployment directly, not in HPA/

Instead of replacing deployments with HPA/ ones, apply fixes
directly to the normal deployment manifests. K8s default is
1 replica, so that can be dropped, which works also better
with HPA.

Because resource requests are model specific, and GMC is used to
change model, HPA/ manifests won't help with that, GMC needs to take
care of that (eventually).

* GMC manifests: update HPA ones to match deployment/service ones

And current Helm charts contents.

* Pre-CI fixes + drop obsolete HPA variable from ChatQnA Chart

Variable was left-over from "Fixes for Helm installed Prometheus version".

* HPA doc example command fixes

---------

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
  • Loading branch information
eero-t authored Sep 11, 2024
1 parent a18404e commit 8d86fff
Show file tree
Hide file tree
Showing 36 changed files with 509 additions and 577 deletions.
196 changes: 196 additions & 0 deletions helm-charts/HPA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# HorizontalPodAutoscaler (HPA) support

## Table of Contents

- [Introduction](#introduction)
- [Pre-conditions](#pre-conditions)
- [Resource requests](#resource-requests)
- [Prometheus](#prometheus)
- [Gotchas](#gotchas)
- [Enable HPA](#enable-hpa)
- [Install](#install)
- [Post-install](#post-install)
- [Verify](#verify)

## Introduction

`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

## Pre-conditions

Read [post-install](#post-install) steps before installation!

### Resource requests

HPA controlled CPU pods SHOULD have appropriate resource requests or affinity rules (enabled in their
subcharts and tested to work) so that k8s scheduler does not schedule too many of them on the same
node(s). Otherwise they never reach ready state.

If you use different models than the default ones, update TGI and TEI resource requests to match
model requirements.

Too large requests would not be a problem as long as pods still fit to available nodes. However,
unless rules have been added to pods preventing them from being scheduled on same nodes, too
small requests would be an issue:

- Multiple inferencing instances interfere / slow down each other, especially if there are no
[NRI policies](https://github.com/opea-project/GenAIEval/tree/main/doc/platform-optimization)
that provide further isolation
- Containers can become non-functional when their actual resource usage crosses the specified limits

### Prometheus

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using a Helm chart for it:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

Prometheus-adapter is also needed, to provide k8s custom metrics based on collected TGI / TEI metrics:
https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter

To install (older versions) of them:

```console
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ prom_ns=monitoring # namespace for Prometheus/-adapter
$ kubectl create ns $prom_ns
$ helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 55.5.2 -n $prom_ns
$ kubectl get services -n $prom_ns
$ helm install prometheus-adapter prometheus-community/prometheus-adapter --version 4.10.0 -n $prom_ns \
--set prometheus.url=http://prometheus-stack-kube-prom-prometheus.$prom_ns.svc \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
```

NOTE: the service name given above in `prometheus.url` must match the listed Prometheus service name,
otherwise adapter cannot access it!

(Alternative for setting the above `prometheusSpec` variable to `false` is making sure that
`prometheusRelease` value in top-level chart matches the release name given to the Prometheus
install i.e. when it differs from `prometheus-stack` used above. That is used to annotate
created serviceMonitors with a label Prometheus requires when above option is `true`.)

## Gotchas

Why HPA is opt-in:

- Installing custom metrics for HPA requires manual post-install steps, as
Prometheus-operator and -adapter are missing support needed to automate that
- Top level chart name needs to conform to Prometheus metric naming conventions,
as it is also used as a metric name prefix (with dashes converted to underscores)
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Unless pod resource requests, affinity rules, scheduling topology constraints and/or cluster NRI
policies are used to better isolate service inferencing pods from each other, instances
scaled up on same node may never get to ready state
- Current HPA rules are just examples, for efficient scaling they need to be fine-tuned for given setup
performance (underlying HW, used models and data types, OPEA version etc)
- Debugging missing custom metric issues is hard as logs rarely include anything helpful

## Enable HPA

### Install

ChatQnA includes pre-configured values files for scaling the services.

To enable HPA, add `-f chatqna/hpa-values.yaml` option to your `helm install` command line.

If **CPU** versions of TGI (and TEI) services are being scaled, resource requests and probe timings
suitable for CPU usage need to be used. Add `-f chatqna/cpu-values.yaml` option to your `helm install`
line. If you need to change model specified there, update the resource requests accordingly.

### Post-install

Above step created custom metrics config for Prometheus-adapter suitable for HPA use.

Take backup of existing custom metrics config before replacing it:

```console
$ prom_ns=monitoring # Prometheus/-adapter namespace
$ name=$(kubectl -n $prom_ns get cm --selector app.kubernetes.io/name=prometheus-adapter -o name | cut -d/ -f2)
$ kubectl -n $prom_ns get cm/$name -o yaml > adapter-config.yaml.bak
```

Save generated config with values matching current adapter config:

```console
$ chart=chatqna # OPEA chart release name
$ kubectl get cm/$chart-custom-metrics -o yaml | sed \
-e "s/name:.*custom-metrics$/name: $name/" \
-e "s/namespace: default$/namespace: $prom_ns/" \
> adapter-config.yaml
```

NOTE: if there are existing custom metric rules you need to retain, add them from saved
`adapter-config.yaml.bak` to `adapter-config.yaml` file now!

Overwrite current Prometheus-adapter configMap with generated one:

```console
$ kubectl delete -n $prom_ns cm/$name
$ kubectl apply -f adapter-config.yaml
```

And restart it, so that it will use the new config:

```console
$ selector=app.kubernetes.io/name=prometheus-adapter
$ kubectl -n $prom_ns delete $(kubectl -n $prom_ns get pod --selector $selector -o name)
```

## Verify

To verify that horizontalPodAutoscaler options work, it's better to check that both metrics
from the inferencing services, and HPA rules using custom metrics generated from them, do work.

(Names of the object names depend on whether Prometheus was installed from manifests, or Helm,
and the release name given for its Helm install.)

Check installed Prometheus service names:

```console
$ prom_ns=monitoring # Prometheus/-adapter namespace
$ kubectl -n $prom_ns get svc
```

Use service name matching your Prometheus installation:

```console
$ prom_svc=prometheus-stack-kube-prom-prometheus # Metrics service
```

Verify Prometheus found metric endpoints for chart services, i.e. last number on `curl` output is non-zero:

```console
$ chart=chatqna # OPEA chart release name
$ prom_url=http://$(kubectl -n $prom_ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/$prom_svc)
$ curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart
```

**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed
their first request, and reranking service will be used only after context data has been uploaded!

Check that both Prometheus metrics required from TGI are available:

```console
$ for m in sum count; do
curl --no-progress-meter $prom_url/api/v1/query? \
--data-urlencode query=tgi_request_inference_duration_$m'{service="'$chart'-tgi"}' | jq;
done | grep __name__
```

PrometheusAdapter lists corresponding TGI and/or TEI custom metrics (prefixed with chart name):

```console
$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name
```

And HPA rules have TARGET values for HPA controlled service deployments (instead of `<unknown>`):

```console
$ ns=default # OPEA namespace
$ kubectl -n $ns get hpa
```
68 changes: 1 addition & 67 deletions helm-charts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,6 @@ This directory contains helm charts for [GenAIComps](https://github.com/opea-pro
- [Components](#components)
- [How to deploy with helm charts](#deploy-with-helm-charts)
- [Helm Charts Options](#helm-charts-options)
- [HorizontalPodAutoscaler (HPA) support](#horizontalpodautoscaler-hpa-support)
- [Pre-conditions](#pre-conditions)
- [Gotchas](#gotchas)
- [Verify HPA metrics](#verify-hpa-metrics)
- [Using Persistent Volume](#using-persistent-volume)
- [Using Private Docker Hub](#using-private-docker-hub)
- [Helm Charts repository](#helm-chart-repository)
Expand Down Expand Up @@ -66,71 +62,9 @@ There are global options(which should be shared across all components of a workl
| global | http_proxy https_proxy no_proxy | Proxy settings. If you are running the workloads behind the proxy, you'll have to add your proxy settings here. |
| global | modelUsePVC | The PersistentVolumeClaim you want to use as huggingface hub cache. Default "" means not using PVC. Only one of modelUsePVC/modelUseHostPath can be set. |
| global | modelUseHostPath | If you don't have Persistent Volume in your k8s cluster and want to use local directory as huggingface hub cache, set modelUseHostPath to your local directory name. Note that this can't share across nodes. Default "". Only one of modelUsePVC/modelUseHostPath can be set. |
| global | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See #pre-conditions and #gotchas before enabling! |
| chatqna | horizontalPodAutoscaler.enabled | Enable HPA autoscaling for TGI and TEI service deployments based on metrics they provide. See [Pre-conditions](HPA.md#pre-conditions) and [Gotchas](HPA.md#gotchas) before enabling! |
| tgi | LLM_MODEL_ID | The model id you want to use for tgi server. Default "Intel/neural-chat-7b-v3-3". |

## HorizontalPodAutoscaler (HPA) support

`horizontalPodAutoscaler` option enables HPA scaling for the TGI and TEI inferencing deployments:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Autoscaling is based on custom application metrics provided through [Prometheus](https://prometheus.io/).

### Pre-conditions

If cluster does not run [Prometheus operator](https://github.com/prometheus-operator/kube-prometheus)
yet, it SHOULD be be installed before enabling HPA, e.g. by using:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

Enabling HPA in top-level Helm chart (e.g. `chatqna`), overwrites cluster's current _PrometheusAdapter_
configuration with relevant custom metric queries. If that has queries you wish to retain, _or_ HPA is
otherwise enabled only in TGI or TEI subchart(s), you need add relevat queries to _PrometheusAdapter_
configuration _manually_ (e.g. from `chatqna` custom metrics Helm template).

### Gotchas

Why HPA is opt-in:

- Enabling (top level) chart `horizontalPodAutoscaler` option will _overwrite_ cluster's current
`PrometheusAdapter` configuration with its own custom metrics configuration.
Take copy of the existing one before install, if that matters:
`kubectl -n monitoring get cm/adapter-config -o yaml > adapter-config.yaml`
- `PrometheusAdapter` needs to be restarted after install, for it to read the new configuration:
`ns=monitoring; kubectl -n $ns delete $(kubectl -n $ns get pod --selector app.kubernetes.io/name=prometheus-adapter -o name)`
- By default Prometheus adds [k8s RBAC rules](https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/prometheus-roleBindingSpecificNamespaces.yaml)
for accessing metrics from `default`, `kube-system` and `monitoring` namespaces. If Helm is
asked to install OPEA services to some other namespace, those rules need to be updated accordingly
- Current HPA rules are examples for Xeon, for efficient scaling they need to be fine-tuned for given setup
performance (underlying HW, used models and data types, OPEA version etc)

### Verify HPA metrics

To verify that metrics required by horizontalPodAutoscaler option work, check following...

Prometheus has found the metric endpoints, i.e. last number on `curl` output is non-zero:

```console
chart=chatqna; # OPEA services prefix
ns=monitoring; # Prometheus namespace
prom_url=http://$(kubectl -n $ns get -o jsonpath="{.spec.clusterIP}:{.spec.ports[0].port}" svc/prometheus-k8s);
curl --no-progress-meter $prom_url/metrics | grep scrape_pool_targets.*$chart
```

**NOTE**: TGI and TEI inferencing services provide metrics endpoint only after they've processed their first request!

PrometheusAdapter lists TGI and/or TGI custom metrics (`te_*` / `tgi_*`):

```console
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name
```

HPA rules list valid (not `<unknown>`) TARGET values for service deployments:

```console
ns=default; # OPEA namespace
kubectl -n $ns get hpa
```

## Using Persistent Volume

It's common to use Persistent Volume(PV) for model caches(huggingface hub cache) in a production k8s cluster. We support to pass the PersistentVolumeClaim(PVC) to containers, but it's the user's responsibility to create the PVC depending on your k8s cluster's capability.
Expand Down
109 changes: 109 additions & 0 deletions helm-charts/chatqna/cpu-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Override CPU resource request and probe timing values in specific subcharts
#
# RESOURCES
#
# Resource request matching actual resource usage (with enough slack)
# is important when service is scaled up, so that right amount of pods
# get scheduled to right nodes.
#
# Because resource usage depends on the used devices, model, data type
# and SW versions, and this top-level chart has overrides for them,
# resource requests need to be specified here too.
#
# To test service without resource request, use "resources: {}".
#
# PROBES
#
# Inferencing pods startup / warmup takes *much* longer on CPUs than
# with acceleration devices, and their responses are also slower,
# especially when node is running several instances of these services.
#
# Kubernetes restarting pod before its startup finishes, or not
# sending it queries because it's not in ready state due to slow
# readiness responses, does really NOT help in getting faster responses.
#
# => probe timings need to be increased when running on CPU.

tgi:
# TODO: add Helm value also for TGI data type option:
# https://github.com/opea-project/GenAIExamples/issues/330
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

# Potentially suitable values for scaling CPU TGI 2.2 with Intel/neural-chat-7b-v3-3 @ 32-bit:
resources:
limits:
cpu: 8
memory: 70Gi
requests:
cpu: 6
memory: 65Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 16
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 180
timeoutSeconds: 2

teirerank:
RERANK_MODEL_ID: "BAAI/bge-reranker-base"

# Potentially suitable values for scaling CPU TEI v1.5 with BAAI/bge-reranker-base model:
resources:
limits:
cpu: 4
memory: 30Gi
requests:
cpu: 2
memory: 25Gi

livenessProbe:
initialDelaySeconds: 8
periodSeconds: 8
failureThreshold: 24
timeoutSeconds: 4
readinessProbe:
initialDelaySeconds: 8
periodSeconds: 8
timeoutSeconds: 4
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120

tei:
EMBEDDING_MODEL_ID: "BAAI/bge-base-en-v1.5"

# Potentially suitable values for scaling CPU TEI 1.5 with BAAI/bge-base-en-v1.5 model:
resources:
limits:
cpu: 4
memory: 4Gi
requests:
cpu: 2
memory: 3Gi

livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 24
timeoutSeconds: 2
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 120
Loading

0 comments on commit 8d86fff

Please sign in to comment.