adding dataprep support for CLIP based models for VideoRAGQnA example…

… for v1.0 (#621) * dataprep service Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * dataprep updates Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * rearranged dirs Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added readme Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * removed checks Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added features Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added get method Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add dim at init, rm unused Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add wait after connect DB Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused Signed-off-by: BaoHuiling <huiling.bao@intel.com> * Update comps/dataprep/vdms/README.md Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com> Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add test script for mm case Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add return value and update readme Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check bug Signed-off-by: BaoHuiling <huiling.bao@intel.com> * fix mm-script Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add into dataprep workflow Signed-off-by: BaoHuiling <huiling.bao@intel.com> * rm whitespace Signed-off-by: BaoHuiling <huiling.bao@intel.com> * updated readme and added test script Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * removed unused file Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move test script Signed-off-by: BaoHuiling <huiling.bao@intel.com> * restructured repo Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates path in test script Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * add name for build Signed-off-by: BaoHuiling <huiling.bao@intel.com> --------- Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Signed-off-by: BaoHuiling <huiling.bao@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: BaoHuiling <huiling.bao@intel.com> Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
opea-project · Sep 11, 2024 · f84d91a · f84d91a
1 parent 4165c7d
commit f84d91a
Show file tree

Hide file tree

Showing 20 changed files with 1,475 additions and 0 deletions.
diff --git a/.github/workflows/docker/compose/dataprep-compose-cd.yaml b/.github/workflows/docker/compose/dataprep-compose-cd.yaml
@@ -23,3 +23,7 @@ services:
     build:
       dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
     image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
+  dataprep-vdms:
+    build:
+      dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
+    image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}
diff --git a/comps/dataprep/vdms/README.md b/comps/dataprep/vdms/README.md
@@ -0,0 +1,189 @@
+# Dataprep Microservice with VDMS
+
+For dataprep microservice, we currently provide one framework: `Langchain`.
+
+<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->
+
+We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.
+
+# 🚀1. Start Microservice with Python (Option 1)
+
+## 1.1 Install Requirements
+
+Install Single-process version (for 1-10 files processing)
+
+```bash
+apt-get update
+apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
+cd langchain
+pip install -r requirements.txt
+```
+
+<!-- - option 2: Install multi-process version (for >10 files processing)
+
+```bash
+cd langchain_ray; pip install -r requirements_ray.txt
+``` -->
+
+## 1.2 Start VDMS Server
+
+Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
+
+## 1.3 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export COLLECTION_NAME=${your_collection_name}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
+export PYTHONPATH=${path_to_comps}
+```
+
+## 1.4 Start Document Preparation Microservice for VDMS with Python Script
+
+Start document preparation microservice for VDMS with below command.
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+python prepare_doc_vdms.py
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+python prepare_doc_redis_on_ray.py
+``` -->
+
+# 🚀2. Start Microservice with Docker (Option 2)
+
+## 2.1 Start VDMS Server
+
+Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
+
+## 2.2 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export TEI_ENDPOINT=${your_tei_endpoint}
+export COLLECTION_NAME=${your_collection_name}
+export SEARCH_ENGINE="FaissFlat"
+export DISTANCE_STRATEGY="L2"
+export PYTHONPATH=${path_to_comps}
+```
+
+## 2.3 Build Docker Image
+
+- Build docker image with langchain
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+cd ../../../
+docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+cd ../../../../
+docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->
+
+## 2.4 Run Docker with CLI
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
+-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
+-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
+opea/dataprep-vdms:latest
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
+-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
+-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
+-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
+``` -->
+
+# 🚀3. Status Microservice
+
+```bash
+docker container logs -f dataprep-vdms-server
+```
+
+# 🚀4. Consume Microservice
+
+Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
+
+Make sure the file path after `files=@` is correct.
+
+- Single file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.txt" \
+    http://localhost:6007/v1/dataprep
+```
+
+You can specify chunk_size and chunk_size by the following commands.
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./LLAMA2_page6.pdf" \
+    -F "chunk_size=1500" \
+    -F "chunk_overlap=100" \
+    http://localhost:6007/v1/dataprep
+```
+
+- Multiple file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.txt" \
+    -F "files=@./file2.txt" \
+    -F "files=@./file3.txt" \
+    http://localhost:6007/v1/dataprep
+```
+
+- Links upload (not supported for llama_index now)
+
+```bash
+curl -X POST \
+    -F 'link_list=["https://www.ces.tech/"]' \
+    http://localhost:6007/v1/dataprep
+```
+
+or
+
+```python
+import requests
+import json
+
+proxies = {"http": ""}
+url = "http://localhost:6007/v1/dataprep"
+urls = [
+    "https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
+]
+payload = {"link_list": json.dumps(urls)}
+
+try:
+    resp = requests.post(url=url, data=payload, proxies=proxies)
+    print(resp.text)
+    resp.raise_for_status()  # Raise an exception for unsuccessful HTTP status codes
+    print("Request successful!")
+except requests.exceptions.RequestException as e:
+    print("An error occurred:", e)
+```
diff --git a/comps/dataprep/vdms/langchain/Dockerfile b/comps/dataprep/vdms/langchain/Dockerfile
@@ -0,0 +1,39 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+ENV LANG=C.UTF-8
+
+ARG ARCH="cpu"
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    build-essential \
+    libcairo2-dev \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
+    if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
+    pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt
+
+ENV PYTHONPATH=/home/user
+
+USER root
+
+RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain
+
+USER user
+
+WORKDIR /home/user/comps/dataprep/vdms/langchain
+
+ENTRYPOINT ["python", "prepare_doc_vdms.py"]
diff --git a/comps/dataprep/vdms/langchain/__init__.py b/comps/dataprep/vdms/langchain/__init__.py
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
diff --git a/comps/dataprep/vdms/langchain/config.py b/comps/dataprep/vdms/langchain/config.py
@@ -0,0 +1,33 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+
+def getEnv(key, default_value=None):
+    env_value = os.getenv(key, default=default_value)
+    print(f"{key}: {env_value}")
+    return env_value
+
+
+# Embedding model
+EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
+
+# VDMS configuration
+VDMS_HOST = getEnv("VDMS_HOST", "localhost")
+VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
+COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
+SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
+DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")
+
+# LLM/Embedding endpoints
+TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
+TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
+TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")
+
+# chunk parameters
+CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
+CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)
+
+current_file_path = os.path.abspath(__file__)
+parent_dir = os.path.dirname(current_file_path)
diff --git a/comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml b/comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml
@@ -0,0 +1,28 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3"
+services:
+  vdms-vector-db:
+    image: intellabs/vdms:latest
+    container_name: vdms-vector-db
+    ports:
+      - "55555:55555"
+  dataprep-vdms:
+    image: opea/dataprep-vdms:latest
+    container_name: dataprep-vdms-server
+    ports:
+      - "6007:6007"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      VDMS_HOST: ${VDMS_HOST}
+      VDMS_PORT: ${VDMS_PORT}
+      COLLECTION_NAME: ${COLLECTION_NAME}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Copyright (C) 2024 Intel Corporation
		# SPDX-License-Identifier: Apache-2.0