Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Commit

Permalink
support top 10 embedding models on the huggingface leaderboard (#571)
Browse files Browse the repository at this point in the history
* add supported models

Signed-off-by: XuhuiRen <xuhui.ren@intel.com>

* add doc

Signed-off-by: XuhuiRen <xuhui.ren@intel.com>

* polish doc

Signed-off-by: XuhuiRen <xuhui.ren@intel.com>

---------

Signed-off-by: XuhuiRen <xuhui.ren@intel.com>
Signed-off-by: XuhuiRen <44249229+XuhuiRen@users.noreply.github.com>
  • Loading branch information
XuhuiRen authored Oct 29, 2023
1 parent 1447e6f commit 3b52e7b
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,20 @@ The Neural Chat API offers an easy way to create and utilize chatbot models whil
1. Dense Retrieval: This method is based on document embeddings, enhancing the accuracy of retrieval. Learn more about [here](https://medium.com/@aikho/deep-learning-in-information-retrieval-part-ii-dense-retrieval-1f9fecb47de9).
2. Sparse Retrieval: Using TF-IDF, this method efficiently retrieves relevant information. Explore this approach in detail [here](https://medium.com/itnext/deep-learning-in-information-retrieval-part-i-introduction-and-sparse-retrieval-12de0423a0b9).

We have already provided support for a wide range of pre-trained embedding models featured on the [HuggingFace text embedding leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Users can conveniently choose an embedding model in two ways: they can either specify the model by its name on HuggingFace or download a model and save it under the default name. Below is a list of some supported embedding models available in our plugin. Users can select their preferred embedding model based on various factors such as model size, embedding dimensions, maximum sequence length, and average ranking score.
| Model | Model Size (GB) |Embedding Dimensions |Max Sequence Length |Average Ranking Score |
| :----: | :----: | :----: | :----: |:----: |
| [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 1.34 |1024 |512 |64.23|
| [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 0.44 |768 |512 |63.55|
| [ gte-large](https://huggingface.co/thenlper/gte-large) | 0.67 |1024 |512 |63.13|
| [stella-base-en-v2](https://huggingface.co/infgrad/stella-base-en-v2) | 0.22 |768 |512 |62.61|
| [gte-base](https://huggingface.co/thenlper/gte-base) | 0.44 |768 |512 |62.39|
| [ e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) | 1.34 |1024 |512 |62.25|
| [instructor-xl](https://huggingface.co/hkunlp/instructor-xl) | 4.96 |768 |512 |61.79|
| [instructor-large](https://huggingface.co/hkunlp/instructor-large) | 1.34 |768 |512 |61.59|

In addition, our plugin seamlessly integrates the online embedding model, Google Palm2 embedding. To set up this feature, please follow the [Google official guideline](https://developers.generativeai.google/tutorials/embeddings_quickstart) to obtain your API key. Once you have your API key, you can activate the Palm2 embedding service by setting the `embedding_model` parameter to 'Google'.

The workflow of this plugin consists of three main operations: document indexing, intent detection, and retrieval. The `Agent_QA` initializes itself using the provided `input_path` to construct a local database. During a conversation, the user's query is first passed to the `IntentDetector` to determine whether the user intends to engage in chitchat or seek answers to specific questions. If the `IntentDetector` determines that the user's query requires an answer, the retriever is activated to search the database using the user's query. The documents retrieved from the database serve as reference context in the input prompt, assisting in generating responses using the Large Language Models (LLMs).

# Usage
Expand Down Expand Up @@ -51,7 +65,7 @@ process [bool]: Select to process the too long document into small chucks. Defau

input_path [str]: The user local path to a file folder or a specific file path. The code itself will check the path is a folder or a file. If it is a folder, the code will process all the files in the given folder. If it is a file, the code will prcess this single file.

embedding_model [str]: the user specific document embedding model for dense retrieval. The user could selecte a specific embedding model from "https://huggingface.co/spaces/mteb/leaderboard". Default to "hkunlp/instructor-large".
embedding_model [str]: the user specific document embedding model for dense retrieval. The user could selecte a specific embedding model from "https://huggingface.co/spaces/mteb/leaderboard". Default to "BAAI/bge-base-en-v1.5".

max_length [int]: The max context length in the processed chucks. Should be combined with "process". Default to "512".

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,15 @@
from haystack.document_stores import InMemoryDocumentStore, ElasticsearchDocumentStore
from langchain.vectorstores.chroma import Chroma
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings, \
HuggingFaceBgeEmbeddings, GooglePalmEmbeddings
from haystack.schema import Document as SDocument
from .context_utils import load_unstructured_data, laod_structured_data, get_chuck_data


class DocumentIndexing:
def __init__(self, retrieval_type="dense", document_store=None, persist_dir="./output",
process=True, embedding_model="hkunlp/instructor-large", max_length=512,
process=True, embedding_model="BAAI/bge-base-en-v1.5", max_length=512,
index_name=None):
"""
Wrapper for document indexing. Support dense and sparse indexing method.
Expand All @@ -36,10 +37,28 @@ def __init__(self, retrieval_type="dense", document_store=None, persist_dir="./o
self.document_store = document_store
self.process = process
self.persist_dir = persist_dir
self.embedding_model = embedding_model
self.max_length = max_length
self.index_name = index_name

try:
if "instruct" in embedding_model:
self.embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model)
elif "bge" in embedding_model:
self.embeddings = HuggingFaceBgeEmbeddings(
model_name=embedding_model,
encode_kwargs={'normalize_embeddings': True},
query_instruction="Represent this sentence for searching relevant passages:")
elif "Google" == embedding_model:
self.embeddings = GooglePalmEmbeddings()
else:
self.embeddings = HuggingFaceEmbeddings(
model_name=embedding_model,
encode_kwargs={"normalize_embeddings": True},
)
except Exception as e:
print("Please selet a proper embedding model")



def parse_document(self, input):
"""
Expand Down Expand Up @@ -83,8 +102,7 @@ def batch_parse_document(self, input):

def load(self, input):
if self.retrieval_type=="dense":
embedding = HuggingFaceInstructEmbeddings(model_name=self.embedding_model)
vectordb = Chroma(persist_directory=self.persist_dir, embedding_function=embedding)
vectordb = Chroma(persist_directory=self.persist_dir, embedding_function=self.embeddings)
else:
if self.document_store == "inmemory":
vectordb = self.KB_construct(input)
Expand Down Expand Up @@ -114,8 +132,7 @@ def KB_construct(self, input):
new_doc = Document(page_content=data, metadata=metadata)
documents.append(new_doc)
assert documents!= [], "The given file/files cannot be loaded."
embedding = HuggingFaceInstructEmbeddings(model_name=self.embedding_model)
vectordb = Chroma.from_documents(documents=documents, embedding=embedding,
vectordb = Chroma.from_documents(documents=documents, embedding=self.embeddings,
persist_directory=self.persist_dir)
vectordb.persist()
print("The local knowledge base has been successfully built!")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

class Agent_QA():
def __init__(self, persist_dir="./output", process=True, input_path=None,
embedding_model="hkunlp/instructor-large", max_length=2048, retrieval_type="dense",
embedding_model="BAAI/bge-base-en-v1.5", max_length=2048, retrieval_type="dense",
document_store=None, top_k=1, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5},
append=True, index_name="elastic_index_1", append_path=None,
response_template = "Please reformat your query to regenerate the answer.",
Expand Down

0 comments on commit 3b52e7b

Please sign in to comment.