[ChatQnA] Support the replica tuning for ChatQnA (#116)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Sep 10, 2024 · 484b69a · 484b69a
1 parent cf8bd83
commit 484b69a
Show file tree

Hide file tree

Showing 22 changed files with 3,117 additions and 0 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,6 +10,7 @@ repos:
         files: (.*\.(py|md|rst|yaml|yml|json|ts|js|html|svelte|sh))$
       - id: check-json
       - id: check-yaml
+        args: [--allow-multiple-documents]
       - id: debug-statements
       - id: requirements-txt-fixer
       - id: trailing-whitespace

diff --git a/evals/auto_tuning/README.md b/evals/auto_tuning/README.md
@@ -0,0 +1,151 @@
+# Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes
+
+This document describes the Auto-Tuning framework, a tool designed to streamline deployment strategies for resource-intensive services, particularly in ChatQnA environments. It leverages Kubernetes for container orchestration and integrates experimental data with out prior knowledge to fine-tune deployments for optimal performance.
+
+## Key Features
+* Hardware Efficiency: Focuses on adjusting replica counts and maximizing the utilization of CPU and HPU (Habana Processing Unit) resources.
+
+* Theoretical and Experimental Optimization: Integrates theoretical best practices with our prior knowledge to ensure optimal resource allocation for services.
+
+# Usage
+
+To generate the strategy.json configuration file for deployment, use the following command:
+
+
+```bash
+# Kubernetes Deployment
+python3 tuning.py --tuning_config replica_tuning_config.json --hardware_info hardware_info_gaudi.json --service_info chatqna_neuralchat_rerank_latest.yaml
+
+# Note: Add --config_only to output deployment configs only.
+```
+
+## Configuration Files
+1. hardware_info_gaudi.json: Specifies the hardware details (CPU, HPU, etc.).
+
+2. chatqna_neuralchat_rerank_latest.yaml: Contains service deployment information.
+
+3. tuning_config.json: Customizes tuning parameters for replica counts and granularity.
+
+### Hardrware_info.json 
+This file lists only the hardware devices to be used in deployment.
+
+```json
+{
+    "device_0": {
+        "ip": ["10.239.1.5", "10.239.10.6"],
+        "type": "hpu",
+        "sockets": 2,
+        "cores_per_socket": 64,
+        "num_cards": 8
+    }
+}
+```
+Please refer to `hardware_info_gaudi.json` for more details.
+
+### chatqna_neuralchat_rerank_latest.yaml
+This file includes all services that will be deployed.
+```yaml
+opea_micro_services:
+    data_prep:
+        ... ...
+    embedding:
+        ... ...
+
+    reranking:
+        ... ...
+
+    llm:
+        opea/llm-tgi:
+            tag: latest
+            type: cpu
+            dependency:
+                ghcr.io/huggingface/tgi-gaudi:
+                    tag: 2.0.4
+                    type: hpu
+                    requirements:
+                        model_id: "Intel/neural-chat-7b-v3-3"
+
+opea_mega_service:
+    opea/chatqna:
+        tag: latest
+        type: cpu
+```
+Please refer to `chatqna_neuralchat_rerank_latest.yaml` for more details.
+
+### Tuning Config Parameters
+
+`embedding_replicas_granularity = 1`: This defines the step size for scaling the number of replicas for the embedding server.
+* Value (1): Each scaling operation increases or decreases the number of replicas by 1 at a time.
+
+`embedding_replicas_min = 1`: This sets the minimum number of replicas allowed for the embedding server.
+* Value (1): The service will always have at least 1 replica running, ensuring that it is available for deployment.
+
+`embedding_replicas_max = 4`: This defines the maximum number of replicas allowed for the embedding server.
+* Value (4): The service can be scaled up to a maximum of 4 replicas, limiting resource consumption and avoiding over-provisioning.
+
+`microservice_replicas_granularity = 1`: This specifies the scaling step size for other microservices (such as retrieval, dataprep, etc.).
+* Value (1): Similar to the embedding_replicas_granularity, the number of replicas for these microservices will scale by 1 replica at a time.
+
+`microservice_replicas_min = 1`: This parameter sets the minimum number of replicas for these microservices.
+* Value (1): Ensures that each microservice always has at least 1 replica running.
+
+`microservice_replicas_max = 4`: This defines the upper limit for scaling replicas for these microservices.
+* Value (4): The maximum number of replicas allowed for the microservices is 4.
+
+
+If you want to adjust the default tuning parameters, just create a replica_tuning_config.json file. For example:
+
+```json
+{
+    "embedding_replicas_granularity": 1,
+    "embedding_replicas_min": 1,
+    "embedding_replicas_max": 4,
+
+    "microservice_replicas_granularity": 1,
+    "microservice_replicas_min": 1,
+    "microservice_replicas_max": 4
+}
+```
+Please refer to `replica_tuning_config.json` for more details.
+
+## Output
+
+The output of the auto-tuning process includes two key components: 
+1. strategy_files: Contains optimized configurations for deploying services, such as replica counts and hardware resource allocations.
+
+2. K8S manifests: Provides the Kubernetes deployment specifications, including pod definitions and resource limits, ready for deployment.
+
+Example of a strategy file:
+```json
+{
+    "embedding-dependency": {
+        "type": "cpu",
+        "image": "ghcr.io/huggingface/text-embeddings-inference:cpu-1.5",
+        "model_id": "BAAI/bge-base-en-v1.5",
+        "replica": 1
+    },
+    "llm-microservice": {
+        "type": "cpu",
+        "image": "opea/llm-tgi:latest",
+        "replica": 4
+    },
+
+    ... ...
+    "reranking-dependency": {
+        "type": "hpu",
+        "image": "opea/tei-gaudi:latest",
+        "model_id": "BAAI/bge-reranker-base",
+        "replica": 1,
+        "cards": 1
+    },
+    "chatqna_mega_service": {
+        "image": "opea/chatqna:latest",
+        "type": "cpu",
+        "replica": 4
+    }
+}
+```
+
+Both the K8S manifests and strategy files are generated in the current directory, providing everything needed for deployment.
+
+Deployment methods: simply run `kubectl apply -f` on the newly generated *_run.yaml files and the chatqna_config_map.
diff --git a/evals/auto_tuning/baseline/chatqna_config_map.yaml b/evals/auto_tuning/baseline/chatqna_config_map.yaml
@@ -0,0 +1,23 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: qna-config
+  namespace: default
+data:
+  EMBEDDING_MODEL_ID: BAAI/bge-base-en-v1.5
+  RERANK_MODEL_ID: BAAI/bge-reranker-base
+  LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
+  TEI_EMBEDDING_ENDPOINT: http://embedding-dependency-svc.default.svc.cluster.local:6006
+  TEI_RERANKING_ENDPOINT: http://reranking-dependency-svc.default.svc.cluster.local:8808
+  TGI_LLM_ENDPOINT: http://llm-dependency-svc.default.svc.cluster.local:9009
+  REDIS_URL: redis://vector-db.default.svc.cluster.local:6379
+  INDEX_NAME: rag-redis
+  HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
+  EMBEDDING_SERVICE_HOST_IP: embedding-svc
+  RETRIEVER_SERVICE_HOST_IP: retriever-svc
+  RERANK_SERVICE_HOST_IP: reranking-svc
+  NODE_SELECTOR: chatqna-opea
+  LLM_SERVICE_HOST_IP: llm-svc
diff --git a/evals/auto_tuning/baseline/chatqna_mega_service.yaml b/evals/auto_tuning/baseline/chatqna_mega_service.yaml
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: chatqna-backend-server-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: chatqna-backend-server-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: chatqna-backend-server-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: chatqna-backend-server-deploy
+      hostIPC: true
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: opea/chatqna:latest
+        imagePullPolicy: IfNotPresent
+        name: chatqna-backend-server-deploy
+        args: null
+        ports:
+        - containerPort: 8888
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: chatqna-backend-server-svc
+spec:
+  type: NodePort
+  selector:
+    app: chatqna-backend-server-deploy
+  ports:
+  - name: service
+    port: 8888
+    targetPort: 8888
+    nodePort: 30888
diff --git a/evals/auto_tuning/baseline/dataprep-microservice.yaml b/evals/auto_tuning/baseline/dataprep-microservice.yaml
@@ -0,0 +1,76 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: dataprep-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: dataprep-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: dataprep-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      topologySpreadConstraints:
+      - maxSkew: 1
+        topologyKey: kubernetes.io/hostname
+        whenUnsatisfiable: ScheduleAnyway
+        labelSelector:
+          matchLabels:
+            app: dataprep-deploy
+      hostIPC: true
+      containers:
+      - env:
+        - name: REDIS_URL
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: REDIS_URL
+        - name: TEI_ENDPOINT
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: TEI_EMBEDDING_ENDPOINT
+        - name: INDEX_NAME
+          valueFrom:
+            configMapKeyRef:
+              name: qna-config
+              key: INDEX_NAME
+        image: opea/dataprep-redis:latest
+        imagePullPolicy: IfNotPresent
+        name: dataprep-deploy
+        args: null
+        ports:
+        - containerPort: 6007
+        - containerPort: 6008
+        - containerPort: 6009
+      serviceAccountName: default
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: dataprep-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: dataprep-deploy
+  ports:
+  - name: port1
+    port: 6007
+    targetPort: 6007
+  - name: port2
+    port: 6008
+    targetPort: 6008
+  - name: port3
+    port: 6009
+    targetPort: 6009
diff --git a/evals/auto_tuning/baseline/embedding-dependency.yaml b/evals/auto_tuning/baseline/embedding-dependency.yaml
@@ -0,0 +1,63 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: embedding-dependency-deploy
+  namespace: default
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: embedding-dependency-deploy
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/rewriteAppHTTPProbers: 'true'
+      labels:
+        app: embedding-dependency-deploy
+    spec:
+      nodeSelector:
+        node-type: chatqna-opea
+      containers:
+      - envFrom:
+        - configMapRef:
+            name: qna-config
+        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
+        name: embedding-dependency-deploy
+        args:
+        - --model-id
+        - $(EMBEDDING_MODEL_ID)
+        - --auto-truncate
+        volumeMounts:
+        - mountPath: /data
+          name: model-volume
+        - mountPath: /dev/shm
+          name: shm
+        ports:
+        - containerPort: 80
+      serviceAccountName: default
+      volumes:
+      - name: model-volume
+        hostPath:
+          path: /mnt/models
+          type: Directory
+      - name: shm
+        emptyDir:
+          medium: Memory
+          sizeLimit: 1Gi
+---
+kind: Service
+apiVersion: v1
+metadata:
+  name: embedding-dependency-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: embedding-dependency-deploy
+  ports:
+  - name: service
+    port: 6006
+    targetPort: 80