Update retrieve_utils.py added lancedb as vectordb #25

akashAD98 · 2023-09-27T09:32:12Z

Why are these changes needed?

I want to use the lancdb as vectordb ,so i have added code, i know we can make it better like passing arguments or more user friendly. so looking for some suggestions how can we add diff vectorestore.thanks

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

thinkall · 2023-09-28T13:45:59Z

Thank you @akashAD98 for the PR. I suggest we keep the current APIs and adding a parameter vector_database to choose different vector databases.

def create_vector_db_from_dir(
    dir_path: str,
    max_tokens: int = 4000,
    client: API = None,
    db_path: str = "/tmp/chromadb.db",
    collection_name: str = "all-my-documents",
    get_or_create: bool = False,
    chunk_mode: str = "multi_lines",
    must_break_at_empty_line: bool = True,
    embedding_model: str = "all-MiniLM-L6-v2",
    vector_database: str = "chromadb",
):

def query_vector_db(
    query_texts: List[str],
    n_results: int = 10,
    client: API = None,
    db_path: str = "/tmp/chromadb.db",
    collection_name: str = "all-my-documents",
    search_string: str = "",
    embedding_model: str = "all-MiniLM-L6-v2",
    vector_database: str = "chromadb",
) -> Dict[str, List[str]]:

We can define different functions to use different vector databases and call them in the current APIs.

def create_chromadb_from_dir():

def query_chromadb():

def create_lancedb_from_dir():

def query_lancedb():

Could you please also update the tests for the new util functions?

Thank you very much again for your contribution.

thinkall

Comments left in the last reply.

codecov-commenter · 2023-09-28T13:50:11Z

Codecov Report

Merging #25 (9978f60) into main (dc70b80) will decrease coverage by 3.49%.
The diff coverage is 17.94%.

@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
- Coverage   39.98%   36.49%   -3.49%     
==========================================
  Files          17       16       -1     
  Lines        2036     2066      +30     
  Branches      453      458       +5     
==========================================
- Hits          814      754      -60     
- Misses       1149     1242      +93     
+ Partials       73       70       -3

Flag	Coverage Δ
unittests	`36.49% <17.94%> (-3.39%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...gen/agentchat/contrib/retrieve_user_proxy_agent.py	`4.39% <0.00%> (-0.03%)`	⬇️
autogen/retrieve_utils.py	`48.52% <18.42%> (-15.17%)`	⬇️

... and 4 files with indirect coverage changes

akashAD98 · 2023-09-28T17:29:46Z

@thinkall yes, thanks for the reply.yes ill add this sure

akashAD98 · 2023-10-03T09:31:57Z

@thinkall as per your suggetion i did changes,can you please review it? let me know if anything i need to do or modify.thanks

thinkall

Thank you so much, @akashAD98 , nice job! I've left some comments, could you please address them?

Could you also add some tests in autogen/test/test_retrieve_utils.py to cover the new functions?

Thank you again for your contribution! Let me know if you need any help.

autogen/retrieve_utils.py

thinkall · 2023-10-03T13:52:10Z

The code format checking is failed, could you please run pre-commit install in your local repo root folder, then you'll enable auto formatting for your code changes.

removed duplicate code & using same embedding function instead of hugging face

akashAD98 · 2023-10-03T16:33:45Z

The code format checking is failed, could you please run pre-commit install in your local repo root folder, then you'll enable auto formatting for your code changes.

yes im working on it. thank you so much for your guidance

added vector_database parameter

akashAD98 · 2023-10-09T17:12:19Z

Hi @akashAD98 , thank you very much for the updates. I left some comments for the code.

The tests are failed, could you update your code to make sure the tests pass, and also add a test for lancedb?

sorry i missed your msg.yes i m doining

thinkall · 2023-10-10T01:05:51Z

Hi @akashAD98 , thank you very much for the updates. I left some comments for the code.
The tests are failed, could you update your code to make sure the tests pass, and also add a test for lancedb?

sorry i missed your msg.yes i m doining

Hi @akashAD98 , I'm thinking about a more general way of supporting different vector dbs in this PR #161 . Could you check if this works for your use case? Thank you very much!

thinkall · 2023-10-10T01:09:25Z

I see your reply in #161. Let me try adding your case as a test in my PR.

akashAD98 · 2023-10-10T06:02:18Z

I see your reply in #161. Let me try adding your case as a test in my PR.

hii
have you tested with lancedb vectordb? for test use case.

8cb1bcd
here its but its failling?

thinkall · 2023-10-10T13:17:46Z

I see your reply in #161. Let me try adding your case as a test in my PR.

hii have you tested with lancedb vectordb? for test use case.

8cb1bcd here its but its failling?

Hi @akashAD98 , check the example here:

autogen/test/test_retrieve_utils.py

Line 103 in fa6e2a5

def test_custom_vector_db(self):

akashAD98 · 2023-10-24T08:44:04Z

@thinkall I'm running the code in collab, i have defind the class

LancedbRetrieveUserProxyAgent

from typing import Callable, Dict, List, Optional

from overrides import override
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from autogen.retrieve_utils import get_files_from_dir, split_files_to_chunks
import logging

logger = logging.getLogger(__name__)

try:
    import lancedb
except ImportError as e:
    logging.fatal("lancedb is not installed. Try running 'pip install lancedb'")
    raise e        



db_path = "/tmp/lancedb"

def create_lancedb():

    db = lancedb.connect(db_path)
    data = [
                {"vector": [1.1, 1.2], "id": 11, "documents": "This is a test document spark"},
                {"vector": [0.2, 1.8], "id": 22, "documents": "This is another test document"},
                {"vector": [0.1, 0.3], "id": 3, "documents": "This is a third test document spark"},
                {"vector": [0.5, 0.7], "id": 44, "documents": "This is a fourth test document"},
                {"vector": [2.1, 1.3], "id": 55, "documents": "This is a fifth test document spark"},
                {"vector": [5.1, 8.3], "id": 66, "documents": "This is a sixth test document"},
            ]
    try:
        db.create_table("my_table", data)
    except OSError:
        pass

class LancedbRetrieveUserProxyAgent(RetrieveUserProxyAgent):
    def query_vector_db(
        self,
        query_texts,
        n_results=10,
        search_string="",):

        
        if query_texts:
            vector = [0.1, 0.3]
        db = lancedb.connect(db_path)
        table = db.open_table("my_table")
        query = table.search(vector).where(f"documents LIKE '%{search_string}%'").limit(n_results).to_df()
        data ={"ids": query["id"].tolist(), "documents": query["documents"].tolist()}
        return data




    def retrieve_docs(self, problem: str, n_results: int = 20, search_string: str = ""):
        results = self.query_vector_db(
            query_texts=[problem],
            n_results=n_results,
            search_string=search_string,
        )

        self._results = results
        print("doc_ids: ", results["ids"])

from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrive_lancedb import LancedbRetrieveUserProxyAgent

from autogen.agentchat.contrib.retrive_lancedb import create_lancedb
autogen.ChatCompletion.start_logging()

# # 1. create an RetrieveAssistantAgent instance named "assistant"
assistant = RetrieveAssistantAgent(
    name="assistant",
    system_message="You are a helpful assistant.",
    llm_config={
        "request_timeout": 600,
        "seed": 42,
        "config_list": config_list,
    },
)


ragragproxyagent = LancedbRetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=2,
    retrieve_config={
        "task": "qa",
        "chunk_token_size": 2000,
        "client": "__",
        "embedding_model": "all-mpnet-base-v2",
    },
)

create_lancedb()


# reset the assistant. Always reset the assistant before starting a new conversation.


code_problem = "How can I use FLAML to perform a classification task and use spark to do parallel training. Train 30 seconds and force cancel jobs if time limit is reached."
ragragproxyagent.initiate_chat(assistant, problem=code_problem)

& im gettiing bewlow error,so whts wrong here?

doc_ids:  [3, 44, 11, 22, 55, 66]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-14-d39f5a89dd2c>](https://localhost:8080/#) in <cell line: 38>()
     36 
     37 code_problem = "How can I use FLAML to perform a classification task and use spark to do parallel training. Train 30 seconds and force cancel jobs if time limit is reached."
---> 38 ragragproxyagent.initiate_chat(assistant, problem=code_problem)
     39 
     40 

2 frames
[/content/autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py](https://localhost:8080/#) in _get_context(self, results)
    243             if idx <= _doc_idx:
    244                 continue
--> 245             if results["ids"][0][idx] in self._doc_ids:
    246                 continue
    247             _doc_tokens = num_tokens_from_text(doc, custom_token_count_function=self.custom_token_count_function)

TypeError: 'int' object is not subscriptable

thinkall · 2023-10-24T15:38:14Z

@akashAD98 , could you try update data ={"ids": query["id"].tolist(), "documents": query["documents"].tolist()} to data ={"ids": [query["id"].tolist()], "documents": [query["documents"].tolist()]}

akashAD98 · 2023-10-25T04:57:32Z

@thinkall yes

its works but its always gives same output

from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrive_lancedb import LancedbRetrieveUserProxyAgent
from autogen.agentchat.contrib.retrive_lancedb import create_lancedb
autogen.ChatCompletion.start_logging()

# 1. create an RetrieveAssistantAgent instance named "assistant"
assistant = RetrieveAssistantAgent(
    name="assistant",
    system_message="You are a helpful assistant.",
    llm_config={
        "request_timeout": 600,
        "seed": 42,
        "config_list": config_list,
    },
)


ragproxyagent = LancedbRetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    retrieve_config={
        "task": "qa",
        "docs_path": "https://raw.githubusercontent.com/microsoft/autogen/main/README.md",  # change this to your own path, such as https://raw.githubusercontent.com/microsoft/autogen/main/README.md
        "chunk_token_size": 2000,
        "client": "__",
        "embedding_model": "all-mpnet-base-v2",
    },
)

also im not user , adding 'client:'--' it is taking lancedb
? or should i need to define it again tere, also its not override any previous data thats reason i guess its always giving same output

i tried this way but working

table = db.open_table("my_table", mode="overwrite")

db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
docsearch = LanceDB.from_documents(documents, embeddings, connection=table)

* make ghClient fetch synchronous * refactor memory, extract knowledge adding to AiAgent

…responses (#25) * rename broadcast to publish * remove require response, remove responses from publishing

Update retrieve_utils.py

7723e9d

akashAD98 changed the title ~~Update retrieve_utils.py added lancdb as vectordb~~ Update retrieve_utils.py added lancedb as vectordb Sep 27, 2023

sonichi requested a review from thinkall September 27, 2023 18:18

Merge branch 'main' into feature/vectordb_Lancedb

104e67c

thinkall requested changes Sep 28, 2023

View reviewed changes

akashAD98 added 2 commits September 28, 2023 23:02

Merge branch 'microsoft:main' into feature/vectordb_Lancedb

547f855

Merge branch 'microsoft:main' into feature/vectordb_Lancedb

ab6fe00

akashAD98 had a problem deploying to openai October 3, 2023 09:28 — with GitHub Actions Failure

Update retrieve_utils.py

ff69abe

akashAD98 had a problem deploying to openai October 3, 2023 09:29 — with GitHub Actions Failure

thinkall requested changes Oct 3, 2023

View reviewed changes

autogen/retrieve_utils.py Outdated Show resolved Hide resolved

autogen/retrieve_utils.py Show resolved Hide resolved

autogen/retrieve_utils.py Show resolved Hide resolved

autogen/retrieve_utils.py Show resolved Hide resolved

thinkall self-assigned this Oct 3, 2023

Update retrieve_utils.py

99e87fa

removed duplicate code & using same embedding function instead of hugging face

akashAD98 had a problem deploying to openai October 3, 2023 16:32 — with GitHub Actions Failure

Update retrieve_user_proxy_agent.py

c5e5cf6

added vector_database parameter

akashAD98 had a problem deploying to openai October 3, 2023 16:46 — with GitHub Actions Failure

thinkall had a problem deploying to openai October 8, 2023 02:47 — with GitHub Actions Failure

thinkall mentioned this pull request Oct 9, 2023

Add support to customized vectordb and embedding functions #161

Merged

3 tasks

Merge branch 'microsoft:main' into feature/vectordb_Lancedb

fe37085

akashAD98 had a problem deploying to openai October 9, 2023 16:53 — with GitHub Actions Failure

Update retrieve_utils.py

7b4f58c

akashAD98 had a problem deploying to openai October 9, 2023 17:00 — with GitHub Actions Failure

Update retrieve_utils.py

96db23e

akashAD98 had a problem deploying to openai October 9, 2023 17:05 — with GitHub Actions Failure

qingyun-wu closed this in #161 Oct 10, 2023

akashAD98 deleted the feature/vectordb_Lancedb branch October 24, 2023 12:28

thinkall mentioned this pull request Oct 24, 2023

Fix tmp dir not exists #401

Merged

3 tasks

thinkall mentioned this pull request Oct 25, 2023

lancedb support #416

Closed

jackgerrits pushed a commit that referenced this pull request Oct 2, 2024

Knowledge abstraction to AiAgent (#25)

fda381a

* make ghClient fetch synchronous * refactor memory, extract knowledge adding to AiAgent

jackgerrits added a commit that referenced this pull request Oct 2, 2024

Remove require_response, rename broadcast to publish, remove publish …

cb55e00

…responses (#25) * rename broadcast to publish * remove require response, remove responses from publishing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update retrieve_utils.py added lancedb as vectordb #25

Update retrieve_utils.py added lancedb as vectordb #25

akashAD98 commented Sep 27, 2023

thinkall commented Sep 28, 2023

thinkall left a comment

codecov-commenter commented Sep 28, 2023 •

edited

Loading

akashAD98 commented Sep 28, 2023

akashAD98 commented Oct 3, 2023

thinkall left a comment

thinkall commented Oct 3, 2023

akashAD98 commented Oct 3, 2023

akashAD98 commented Oct 9, 2023

thinkall commented Oct 10, 2023

thinkall commented Oct 10, 2023

akashAD98 commented Oct 10, 2023 •

edited

Loading

thinkall commented Oct 10, 2023

akashAD98 commented Oct 24, 2023 •

edited

Loading

thinkall commented Oct 24, 2023

akashAD98 commented Oct 25, 2023 •

edited

Loading

Update retrieve_utils.py added lancedb as vectordb #25

Update retrieve_utils.py added lancedb as vectordb #25

Conversation

akashAD98 commented Sep 27, 2023

Why are these changes needed?

Related issue number

Checks

thinkall commented Sep 28, 2023

thinkall left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 28, 2023 • edited Loading

Codecov Report

akashAD98 commented Sep 28, 2023

akashAD98 commented Oct 3, 2023

thinkall left a comment

Choose a reason for hiding this comment

thinkall commented Oct 3, 2023

akashAD98 commented Oct 3, 2023

akashAD98 commented Oct 9, 2023

thinkall commented Oct 10, 2023

thinkall commented Oct 10, 2023

akashAD98 commented Oct 10, 2023 • edited Loading

thinkall commented Oct 10, 2023

akashAD98 commented Oct 24, 2023 • edited Loading

thinkall commented Oct 24, 2023

akashAD98 commented Oct 25, 2023 • edited Loading

codecov-commenter commented Sep 28, 2023 •

edited

Loading

akashAD98 commented Oct 10, 2023 •

edited

Loading

akashAD98 commented Oct 24, 2023 •

edited

Loading

akashAD98 commented Oct 25, 2023 •

edited

Loading