Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a vectordb module #2263

Merged
merged 42 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
e3e6783
Added vectordb base and chromadb
thinkall Apr 2, 2024
5c0770a
Remove timer and unused functions
thinkall Apr 3, 2024
901ac1a
Added filter by distance
thinkall Apr 3, 2024
5af4a0c
Added test utils
thinkall Apr 3, 2024
1d9984e
Fix format
thinkall Apr 3, 2024
df6c549
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 3, 2024
7793a06
Fix type hint of dict
thinkall Apr 3, 2024
99e913a
Merge branch 'refactor_abstract_vectordb' of github.com:microsoft/aut…
thinkall Apr 3, 2024
cac4804
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 3, 2024
cf7fcda
Rename test
thinkall Apr 3, 2024
7d32490
Add test chromadb
thinkall Apr 3, 2024
b2a80b5
Fix test no chromadb
thinkall Apr 3, 2024
c2dce1d
Add coverage
thinkall Apr 3, 2024
dddf1ec
Don't skip test vectordb utils
thinkall Apr 3, 2024
cfc504d
Merge remote-tracking branch 'origin/main' into refactor_abstract_vec…
thinkall Apr 4, 2024
95df200
Add types
thinkall Apr 4, 2024
83ed9d0
Fix tests
thinkall Apr 4, 2024
41c4c44
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 4, 2024
b50d93c
Fix docs build error
thinkall Apr 4, 2024
5e37f4a
Add types to base
thinkall Apr 4, 2024
9bd612a
Update base
thinkall Apr 4, 2024
bc61162
Update utils
thinkall Apr 4, 2024
e9ece8a
Update chromadb
thinkall Apr 5, 2024
1db36d6
Add get_docs_by_ids
thinkall Apr 5, 2024
9609636
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 5, 2024
5c0cb43
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 5, 2024
ccb9494
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 5, 2024
30e2a98
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 5, 2024
402d755
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 5, 2024
5a44f8b
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 6, 2024
411a1db
Improve docstring
thinkall Apr 6, 2024
a78d572
Add get all docs
thinkall Apr 7, 2024
4b644c3
Move chroma_results_to_query_results to utils
thinkall Apr 7, 2024
aed255e
Improve type hints
thinkall Apr 7, 2024
9cdc97b
Update logger
thinkall Apr 7, 2024
05d34a4
Update init, add embedding func
thinkall Apr 7, 2024
283a16c
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 8, 2024
100febb
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 8, 2024
ac5224d
Improve docstring of vectordb, add two attributes
thinkall Apr 9, 2024
11100ec
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 9, 2024
2518f3d
Merge branch 'main' into refactor_abstract_vectordb
thinkall Apr 10, 2024
6a126c8
Improve test workflow
thinkall Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/contrib-openai.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
AZURE_OPENAI_API_BASE: ${{ secrets.AZURE_OPENAI_API_BASE }}
OAI_CONFIG_LIST: ${{ secrets.OAI_CONFIG_LIST }}
run: |
coverage run -a -m pytest test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py
coverage run -a -m pytest test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py test/agentchat/contrib/vectordb
thinkall marked this conversation as resolved.
Show resolved Hide resolved
coverage xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/contrib-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,11 @@ jobs:
fi
- name: Test RetrieveChat
run: |
pytest test/test_retrieve_utils.py test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py --skip-openai
pytest test/test_retrieve_utils.py test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py test/agentchat/contrib/vectordb --skip-openai
thinkall marked this conversation as resolved.
Show resolved Hide resolved
- name: Coverage
run: |
pip install coverage>=5.3
coverage run -a -m pytest test/test_retrieve_utils.py test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py --skip-openai
coverage run -a -m pytest test/test_retrieve_utils.py test/agentchat/contrib/test_retrievechat.py test/agentchat/contrib/test_qdrant_retrievechat.py test/agentchat/contrib/vectordb --skip-openai
coverage xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
Expand Down
Empty file.
190 changes: 190 additions & 0 deletions autogen/agentchat/contrib/vectordb/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
from typing import Any, List, Mapping, Optional, Protocol, Sequence, Tuple, TypedDict, Union, runtime_checkable

Metadata = Union[Mapping[str, Any], None]
Vector = Union[Sequence[float], Sequence[int]]
ItemID = Union[str, int] # chromadb doesn't support int ids, VikingDB does


class Document(TypedDict):
"""A Document is a record in the vector database.

id: ItemID | the unique identifier of the document.
content: str | the text content of the chunk.
metadata: Metadata, Optional | contains additional information about the document such as source, date, etc.
embedding: Vector, Optional | the vector representation of the content.
"""

id: ItemID
content: str
metadata: Optional[Metadata]
embedding: Optional[Vector]


"""QueryResults is the response from the vector database for a query/queries.
A query is a list containing one string while queries is a list containing multiple strings.
The response is a list of query results, each query result is a list of tuples containing the document and the distance.
"""
QueryResults = List[List[Tuple[Document, float]]]


@runtime_checkable
class VectorDB(Protocol):
"""
Abstract class for vector database. A vector database is responsible for storing and retrieving documents.
"""

def create_collection(self, collection_name: str, overwrite: bool = False, get_or_create: bool = True) -> Any:
thinkall marked this conversation as resolved.
Show resolved Hide resolved
"""
Create a collection in the vector database.
Case 1. if the collection does not exist, create the collection.
Case 2. the collection exists, if overwrite is True, it will overwrite the collection.
Case 3. the collection exists and overwrite is False, if get_or_create is True, it will get the collection,
otherwise it raise a ValueError.

Args:
collection_name: str | The name of the collection.
overwrite: bool | Whether to overwrite the collection if it exists. Default is False.
get_or_create: bool | Whether to get the collection if it exists. Default is True.

Returns:
Any | The collection object.
"""
...

def get_collection(self, collection_name: str = None) -> Any:
"""
Get the collection from the vector database.

Args:
collection_name: str | The name of the collection. Default is None. If None, return the
current active collection.

Returns:
Any | The collection object.
"""
...

def delete_collection(self, collection_name: str) -> Any:
"""
Delete the collection from the vector database.

Args:
collection_name: str | The name of the collection.

Returns:
Any
"""
...

def insert_docs(self, docs: List[Document], collection_name: str = None, upsert: bool = False, **kwargs) -> None:
thinkall marked this conversation as resolved.
Show resolved Hide resolved
"""
Insert documents into the collection of the vector database.

Args:
docs: List[Document] | A list of documents. Each document is a TypedDict `Document`.
collection_name: str | The name of the collection. Default is None.
upsert: bool | Whether to update the document if it exists. Default is False.
kwargs: Dict | Additional keyword arguments.

Returns:
None
thinkall marked this conversation as resolved.
Show resolved Hide resolved
"""
...

def update_docs(self, docs: List[Document], collection_name: str = None, **kwargs) -> None:
"""
Update documents in the collection of the vector database.

Args:
docs: List[Document] | A list of documents.
collection_name: str | The name of the collection. Default is None.
kwargs: Dict | Additional keyword arguments.

Returns:
None
"""
...

def delete_docs(self, ids: List[ItemID], collection_name: str = None, **kwargs) -> None:
"""
Delete documents from the collection of the vector database.

Args:
ids: List[ItemID] | A list of document ids. Each id is a typed `ItemID`.
collection_name: str | The name of the collection. Default is None.
kwargs: Dict | Additional keyword arguments.

Returns:
None
"""
...
thinkall marked this conversation as resolved.
Show resolved Hide resolved

def retrieve_docs(
thinkall marked this conversation as resolved.
Show resolved Hide resolved
self,
queries: List[str],
thinkall marked this conversation as resolved.
Show resolved Hide resolved
collection_name: str = None,
n_results: int = 10,
distance_threshold: float = -1,
thinkall marked this conversation as resolved.
Show resolved Hide resolved
**kwargs,
thinkall marked this conversation as resolved.
Show resolved Hide resolved
thinkall marked this conversation as resolved.
Show resolved Hide resolved
) -> QueryResults:
"""
Retrieve documents from the collection of the vector database based on the queries.

Args:
queries: List[str] | A list of queries. Each query is a string.
collection_name: str | The name of the collection. Default is None.
n_results: int | The number of relevant documents to return. Default is 10.
distance_threshold: float | The threshold for the distance score, only distance smaller than it will be
returned. Don't filter with it if < 0. Default is -1.
kwargs: Dict | Additional keyword arguments.

Returns:
QueryResults | The query results. Each query result is a list of list of tuples containing the document and
the distance.
"""
...

def get_docs_by_ids(self, ids: List[ItemID], collection_name: str = None, include=None, **kwargs) -> List[Document]:
"""
Retrieve documents from the collection of the vector database based on the ids.

Args:
ids: List[ItemID] | A list of document ids.
collection_name: str | The name of the collection. Default is None.
include: List[str] | The fields to include. Default is None.
If None, will include ["metadatas", "documents"], ids will always be included.
kwargs: dict | Additional keyword arguments.

Returns:
List[Document] | The results.
"""
...


class VectorDBFactory:
thinkall marked this conversation as resolved.
Show resolved Hide resolved
"""
Factory class for creating vector databases.
"""

PREDEFINED_VECTOR_DB = ["chroma"]

@staticmethod
def create_vector_db(db_type: str, **kwargs) -> VectorDB:
"""
Create a vector database.

Args:
db_type: str | The type of the vector database.
kwargs: Dict | The keyword arguments for initializing the vector database.

Returns:
VectorDB | The vector database.
"""
if db_type.lower() in ["chroma", "chromadb"]:
from .chromadb import ChromaVectorDB

return ChromaVectorDB(**kwargs)
else:
raise ValueError(
f"Unsupported vector database type: {db_type}. Valid types are {VectorDBFactory.PREDEFINED_VECTOR_DB}."
)
Loading
Loading