Skip to content

Commit

Permalink
adding dataprep support for CLIP based models for VideoRAGQnA example…
Browse files Browse the repository at this point in the history
… for v1.0 (#621)

* dataprep service

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* dataprep updates

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* rearranged dirs

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added readme

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* removed checks

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added features

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added get method

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add dim at init, rm unused

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add wait after connect DB

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* Update comps/dataprep/vdms/README.md

Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add test script for mm case

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add return value and update readme

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check bug

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* fix mm-script

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add into dataprep workflow

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* rm whitespace

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* updated readme and added test script

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* removed unused file

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move test script

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* restructured repo

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates path in test script

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* add name for build

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

---------

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
Signed-off-by: BaoHuiling <huiling.bao@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: BaoHuiling <huiling.bao@intel.com>
Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
  • Loading branch information
4 people authored Sep 11, 2024
1 parent 4165c7d commit f84d91a
Show file tree
Hide file tree
Showing 20 changed files with 1,475 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/docker/compose/dataprep-compose-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,7 @@ services:
build:
dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
dataprep-vdms:
build:
dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}
189 changes: 189 additions & 0 deletions comps/dataprep/vdms/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Dataprep Microservice with VDMS

For dataprep microservice, we currently provide one framework: `Langchain`.

<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->

We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.

# 🚀1. Start Microservice with Python (Option 1)

## 1.1 Install Requirements

Install Single-process version (for 1-10 files processing)

```bash
apt-get update
apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
cd langchain
pip install -r requirements.txt
```

<!-- - option 2: Install multi-process version (for >10 files processing)
```bash
cd langchain_ray; pip install -r requirements_ray.txt
``` -->

## 1.2 Start VDMS Server

Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).

## 1.3 Setup Environment Variables

```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export COLLECTION_NAME=${your_collection_name}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
export PYTHONPATH=${path_to_comps}
```

## 1.4 Start Document Preparation Microservice for VDMS with Python Script

Start document preparation microservice for VDMS with below command.

Start single-process version (for 1-10 files processing)

```bash
python prepare_doc_vdms.py
```

<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
python prepare_doc_redis_on_ray.py
``` -->

# 🚀2. Start Microservice with Docker (Option 2)

## 2.1 Start VDMS Server

Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).

## 2.2 Setup Environment Variables

```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export TEI_ENDPOINT=${your_tei_endpoint}
export COLLECTION_NAME=${your_collection_name}
export SEARCH_ENGINE="FaissFlat"
export DISTANCE_STRATEGY="L2"
export PYTHONPATH=${path_to_comps}
```

## 2.3 Build Docker Image

- Build docker image with langchain

Start single-process version (for 1-10 files processing)

```bash
cd ../../../
docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
```

<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
cd ../../../../
docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->

## 2.4 Run Docker with CLI

Start single-process version (for 1-10 files processing)

```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
opea/dataprep-vdms:latest
```

<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
``` -->

# 🚀3. Status Microservice

```bash
docker container logs -f dataprep-vdms-server
```

# 🚀4. Consume Microservice

Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

Make sure the file path after `files=@` is correct.

- Single file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./LLAMA2_page6.pdf" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep
```

- Multiple file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "files=@./file2.txt" \
-F "files=@./file3.txt" \
http://localhost:6007/v1/dataprep
```

- Links upload (not supported for llama_index now)

```bash
curl -X POST \
-F 'link_list=["https://www.ces.tech/"]' \
http://localhost:6007/v1/dataprep
```

or

```python
import requests
import json

proxies = {"http": ""}
url = "http://localhost:6007/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
]
payload = {"link_list": json.dumps(urls)}

try:
resp = requests.post(url=url, data=payload, proxies=proxies)
print(resp.text)
resp.raise_for_status() # Raise an exception for unsuccessful HTTP status codes
print("Request successful!")
except requests.exceptions.RequestException as e:
print("An error occurred:", e)
```
39 changes: 39 additions & 0 deletions comps/dataprep/vdms/langchain/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

ENV LANG=C.UTF-8

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libcairo2-dev \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt

ENV PYTHONPATH=/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain

USER user

WORKDIR /home/user/comps/dataprep/vdms/langchain

ENTRYPOINT ["python", "prepare_doc_vdms.py"]
2 changes: 2 additions & 0 deletions comps/dataprep/vdms/langchain/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
33 changes: 33 additions & 0 deletions comps/dataprep/vdms/langchain/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os


def getEnv(key, default_value=None):
env_value = os.getenv(key, default=default_value)
print(f"{key}: {env_value}")
return env_value


# Embedding model
EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")

# VDMS configuration
VDMS_HOST = getEnv("VDMS_HOST", "localhost")
VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")

# LLM/Embedding endpoints
TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")

# chunk parameters
CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)

current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)
28 changes: 28 additions & 0 deletions comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3"
services:
vdms-vector-db:
image: intellabs/vdms:latest
container_name: vdms-vector-db
ports:
- "55555:55555"
dataprep-vdms:
image: opea/dataprep-vdms:latest
container_name: dataprep-vdms-server
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
VDMS_HOST: ${VDMS_HOST}
VDMS_PORT: ${VDMS_PORT}
COLLECTION_NAME: ${COLLECTION_NAME}
restart: unless-stopped

networks:
default:
driver: bridge
Loading

0 comments on commit f84d91a

Please sign in to comment.