Skip to content

Commit

Permalink
Refactor milvus dataprep and retriever (#728)
Browse files Browse the repository at this point in the history
* milvus: Refactor embedding settings for mivlus dataprep and retriever

Milvus dataprep and retriever leverage the same embedding enpoints, but
the embedding-related code is somewhat messed up, unify the namings and
logic to improve code readability and facilitate user-friendly
configuration.

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>

* MOSEC: Rename EMB_MODEL env as MOSEC_EMBEDDING_MODEL

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>

* milvus/dataprep: Update README for milvus dataprep

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>

* Add OCR package for Milvus dataprep

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>

* Update Milvus dataprep test script

This is to fix the CI issue for MILVUS environment variable name is
update.

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
bjzhjing and pre-commit-ci[bot] authored Oct 10, 2024
1 parent 2bbee5d commit 84374a5
Show file tree
Hide file tree
Showing 10 changed files with 72 additions and 100 deletions.
3 changes: 2 additions & 1 deletion comps/dataprep/milvus/langchain/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missin
build-essential \
default-jre \
libgl1-mesa-glx \
libjemalloc-dev
libjemalloc-dev \
tesseract-ocr

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
Expand Down
15 changes: 10 additions & 5 deletions comps/dataprep/milvus/langchain/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Please refer to this [readme](../../../vectorstores/milvus/README.md).
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export MILVUS=${your_milvus_host_ip}
export MILVUS_HOST=${your_milvus_host_ip}
export MILVUS_PORT=19530
export COLLECTION_NAME=${your_collection_name}
export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
Expand All @@ -47,7 +47,7 @@ Setup environment variables:

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS=${your_host_ip}
export MILVUS_HOST=${your_host_ip}
```

### 1.5 Start Document Preparation Microservice for Milvus with Python Script
Expand Down Expand Up @@ -78,19 +78,24 @@ docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS=${your_host_ip}
export MILVUS_HOST=${your_host_ip}
```

### 2.3 Run Docker with CLI (Option A)

```bash
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} opea/dataprep-milvus:latest
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} opea/dataprep-milvus:latest
```

### 2.4 Run with Docker Compose (Option B)

```bash
cd docker
mkdir model
cd model
git clone https://huggingface.co/BAAI/bge-base-en-v1.5
cd ../
# Update `host_ip` and `HUGGINGFACEHUB_API_TOKEN` in set_env.sh
. set_env.sh
docker compose -f docker-compose-dataprep-milvus.yaml up -d
```

Expand Down
12 changes: 6 additions & 6 deletions comps/dataprep/milvus/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@

import os

# Embedding model
TEI_EMBEDDING_MODEL = os.getenv("EMBED_MODEL", "maidalun1020/bce-embedding-base_v1")
# Embedding endpoints
# Local Embedding model
LOCAL_EMBEDDING_MODEL = os.getenv("LOCAL_EMBEDDING_MODEL", "maidalun1020/bce-embedding-base_v1")
# TEI Embedding endpoints
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
# MILVUS configuration
MILVUS_HOST = os.getenv("MILVUS", "localhost")
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
MILVUS_PORT = int(os.getenv("MILVUS_PORT", 19530))
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_milvus")

MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bce-embedding-base_v1")
# MOSEC configuration
MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bge-large-zh-v1.5")
MOSEC_EMBEDDING_ENDPOINT = os.environ.get("MOSEC_EMBEDDING_ENDPOINT", "")
os.environ["OPENAI_API_BASE"] = MOSEC_EMBEDDING_ENDPOINT
os.environ["OPENAI_API_KEY"] = "Dummy key"
95 changes: 27 additions & 68 deletions comps/dataprep/milvus/langchain/prepare_doc_milvus.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@

from config import (
COLLECTION_NAME,
LOCAL_EMBEDDING_MODEL,
MILVUS_HOST,
MILVUS_PORT,
MOSEC_EMBEDDING_ENDPOINT,
MOSEC_EMBEDDING_MODEL,
TEI_EMBEDDING_ENDPOINT,
TEI_EMBEDDING_MODEL,
)
from fastapi import Body, File, Form, HTTPException, UploadFile
from langchain.text_splitter import RecursiveCharacterTextSplitter
Expand Down Expand Up @@ -73,7 +73,7 @@ def empty_embedding() -> List[float]:
return [e if e is not None else empty_embedding() for e in batched_embeddings]


def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
def ingest_chunks_to_milvus(file_name: str, chunks: List):
if logflag:
logger.info(f"[ ingest chunks ] file name: {file_name}")

Expand All @@ -94,7 +94,7 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
try:
_ = Milvus.from_documents(
batch_docs,
embedder,
embeddings,
collection_name=COLLECTION_NAME,
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
partition_key_field=partition_field_name,
Expand All @@ -110,7 +110,7 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
return True


def ingest_data_to_milvus(doc_path: DocPath, embedder):
def ingest_data_to_milvus(doc_path: DocPath):
"""Ingest document to Milvus."""
path = doc_path.path
file_name = path.split("/")[-1]
Expand Down Expand Up @@ -151,7 +151,7 @@ def ingest_data_to_milvus(doc_path: DocPath, embedder):
if logflag:
logger.info(f"[ ingest data ] Done preprocessing. Created {len(chunks)} chunks of the original file.")

return ingest_chunks_to_milvus(file_name, chunks, embedder)
return ingest_chunks_to_milvus(file_name, chunks)


def search_by_file(collection, file_name):
Expand Down Expand Up @@ -210,28 +210,9 @@ async def ingest_documents(
if files and link_list:
raise HTTPException(status_code=400, detail="Provide either a file or a string list, not both.")

# Create vectorstore
if MOSEC_EMBEDDING_ENDPOINT:
# create embeddings using MOSEC endpoint service
if logflag:
logger.info(
f"[ upload ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
)
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
elif TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
if logflag:
logger.info(f"[ upload ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
if logflag:
logger.info(f"[ upload ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)

# define Milvus obj
my_milvus = Milvus(
embedding_function=embedder,
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
index_params=index_params,
Expand Down Expand Up @@ -274,7 +255,6 @@ async def ingest_documents(
process_table=process_table,
table_strategy=table_strategy,
),
embedder,
)
uploaded_files.append(save_path)
if logflag:
Expand All @@ -294,7 +274,6 @@ async def ingest_documents(
# process_table=process_table,
# table_strategy=table_strategy,
# ),
# embedder
# )

# try:
Expand Down Expand Up @@ -352,7 +331,6 @@ async def ingest_documents(
process_table=process_table,
table_strategy=table_strategy,
),
embedder,
)
if logflag:
logger.info(f"[ upload ] Successfully saved link list {link_list}")
Expand All @@ -368,28 +346,9 @@ async def rag_get_file_structure():
if logflag:
logger.info("[ get ] start to get file structure")

# Create vectorstore
if MOSEC_EMBEDDING_ENDPOINT:
# create embeddings using MOSEC endpoint service
if logflag:
logger.info(
f"[ get ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
)
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
elif TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
if logflag:
logger.info(f"[ get ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
if logflag:
logger.info(f"[ get ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)

# define Milvus obj
my_milvus = Milvus(
embedding_function=embedder,
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
index_params=index_params,
Expand Down Expand Up @@ -445,28 +404,9 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
if logflag:
logger.info(file_path)

# Create vectorstore
if MOSEC_EMBEDDING_ENDPOINT:
# create embeddings using MOSEC endpoint service
if logflag:
logger.info(
f"[ delete ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
)
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
elif TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
if logflag:
logger.info(f"[ delete ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
if logflag:
logger.info(f"[ delete ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)

# define Milvus obj
my_milvus = Milvus(
embedding_function=embedder,
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
index_params=index_params,
Expand Down Expand Up @@ -533,4 +473,23 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
if __name__ == "__main__":
create_upload_folder(upload_folder)

# Create vectorstore
if MOSEC_EMBEDDING_ENDPOINT:
# create embeddings using MOSEC endpoint service
if logflag:
logger.info(
f"[ prepare_doc_milvus ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
)
embeddings = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
elif TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
if logflag:
logger.info(f"[ prepare_doc_milvus ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
embeddings = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
if logflag:
logger.info(f"[ prepare_doc_milvus ] LOCAL_EMBEDDING_MODEL:{LOCAL_EMBEDDING_MODEL}")
embeddings = HuggingFaceBgeEmbeddings(model_name=LOCAL_EMBEDDING_MODEL)

opea_microservices["opea_service@prepare_doc_milvus"].start()
2 changes: 1 addition & 1 deletion comps/embeddings/mosec/langchain/dependency/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ RUN pip3 install llmspec mosec

RUN cd /home/user/ && export HF_ENDPOINT=https://hf-mirror.com && huggingface-cli download --resume-download BAAI/bge-large-zh-v1.5 --local-dir /home/user/bge-large-zh-v1.5
USER user
ENV EMB_MODEL="/home/user/bge-large-zh-v1.5/"
ENV MOSEC_EMBEDDING_MODEL="/home/user/bge-large-zh-v1.5/"

WORKDIR /home/user/comps/embeddings/mosec/langchain/dependency

Expand Down
2 changes: 1 addition & 1 deletion comps/embeddings/mosec/langchain/dependency/server-ipex.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

class Embedding(Worker):
def __init__(self):
self.model_name = os.environ.get("EMB_MODEL", DEFAULT_MODEL)
self.model_name = os.environ.get("MOSEC_EMBEDDING_MODEL", DEFAULT_MODEL)
self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_name)
self.model = transformers.AutoModel.from_pretrained(self.model_name)
self.device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"
Expand Down
2 changes: 1 addition & 1 deletion comps/reranks/mosec/langchain/dependency/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ RUN pip3 install llmspec mosec

RUN cd /home/user/ && export HF_ENDPOINT=https://hf-mirror.com && huggingface-cli download --resume-download BAAI/bge-reranker-base --local-dir /home/user/bge-reranker-large
USER user
ENV EMB_MODEL="/home/user/bge-reranker-large/"
ENV MOSEC_EMBEDDING_MODEL="/home/user/bge-reranker-large/"

WORKDIR /home/user/comps/reranks/mosec/langchain/dependency

Expand Down
15 changes: 7 additions & 8 deletions comps/retrievers/milvus/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,16 @@

import os

# Embedding model
EMBED_MODEL = os.getenv("EMBED_MODEL", "maidalun1020/bce-embedding-base_v1")
# Embedding endpoints
EMBED_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
# Local Embedding model
LOCAL_EMBEDDING_MODEL = os.getenv("LOCAL_EMBEDDING_MODEL", "maidalun1020/bce-embedding-base_v1")
# TEI Embedding endpoints
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
# MILVUS configuration
MILVUS_HOST = os.getenv("MILVUS", "localhost")
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
MILVUS_PORT = int(os.getenv("MILVUS_PORT", 19530))
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_milvus")


# MOSEC configuration
MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bce-embedding-base_v1")
MOSEC_EMBEDDING_ENDPOINT = os.environ.get("MOSEC_EMBEDDING_ENDPOINT", "")
os.environ["OPENAI_API_BASE"] = MOSEC_EMBEDDING_ENDPOINT
os.environ["OPENAI_API_KEY"] = "Dummy key"
MODEL_ID = "/home/user/bce-embedding-base_v1"
22 changes: 15 additions & 7 deletions comps/retrievers/milvus/langchain/retriever_milvus.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@

from config import (
COLLECTION_NAME,
EMBED_ENDPOINT,
EMBED_MODEL,
LOCAL_EMBEDDING_MODEL,
MILVUS_HOST,
MILVUS_PORT,
MODEL_ID,
MOSEC_EMBEDDING_ENDPOINT,
MOSEC_EMBEDDING_MODEL,
TEI_EMBEDDING_ENDPOINT,
)
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceHubEmbeddings, OpenAIEmbeddings
from langchain_milvus.vectorstores import Milvus

from comps import (
Expand Down Expand Up @@ -106,11 +106,19 @@ async def retrieve(input: EmbedDoc) -> SearchedDoc:
if __name__ == "__main__":
# Create vectorstore
if MOSEC_EMBEDDING_ENDPOINT:
# create embeddings using Mosec endpoint service
if logflag:
logger.info(f"[ retriever_milvus ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}")
embeddings = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
elif TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
# embeddings = HuggingFaceHubEmbeddings(model=EMBED_ENDPOINT)
embeddings = MosecEmbeddings(model=MODEL_ID)
if logflag:
logger.info(f"[ retriever_milvus ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
embeddings = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
embeddings = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)
if logflag:
logger.info(f"[ retriever_milvus ] LOCAL_EMBEDDING_MODEL:{LOCAL_EMBEDDING_MODEL}")
embeddings = HuggingFaceBgeEmbeddings(model_name=LOCAL_EMBEDDING_MODEL)

opea_microservices["opea_service@retriever_milvus"].start()
4 changes: 2 additions & 2 deletions tests/dataprep/test_dataprep_milvus_langchain.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ function start_service() {

# start dataprep service
MOSEC_EMBEDDING_ENDPOINT="http://${ip_address}:${mosec_embedding_port}"
MILVUS=${ip_address}
docker run -d --name="test-comps-dataprep-milvus-server" -p ${dataprep_service_port}:6010 -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} -e LOGFLAG=true --ipc=host opea/dataprep-milvus:comps
MILVUS_HOST=${ip_address}
docker run -d --name="test-comps-dataprep-milvus-server" -p ${dataprep_service_port}:6010 -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} -e LOGFLAG=true --ipc=host opea/dataprep-milvus:comps
sleep 1m
}

Expand Down

0 comments on commit 84374a5

Please sign in to comment.