Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retrieve_utils.py added lancedb as vectordb #25

Closed
wants to merge 14 commits into from

Conversation

akashAD98
Copy link

Why are these changes needed?

I want to use the lancdb as vectordb ,so i have added code, i know we can make it better like passing arguments or more user friendly. so looking for some suggestions how can we add diff vectorestore.thanks

Related issue number

Checks

@akashAD98 akashAD98 changed the title Update retrieve_utils.py added lancdb as vectordb Update retrieve_utils.py added lancedb as vectordb Sep 27, 2023
@sonichi sonichi requested a review from thinkall September 27, 2023 18:18
@thinkall
Copy link
Collaborator

Thank you @akashAD98 for the PR. I suggest we keep the current APIs and adding a parameter vector_database to choose different vector databases.

def create_vector_db_from_dir(
    dir_path: str,
    max_tokens: int = 4000,
    client: API = None,
    db_path: str = "/tmp/chromadb.db",
    collection_name: str = "all-my-documents",
    get_or_create: bool = False,
    chunk_mode: str = "multi_lines",
    must_break_at_empty_line: bool = True,
    embedding_model: str = "all-MiniLM-L6-v2",
    vector_database: str = "chromadb",
):

def query_vector_db(
    query_texts: List[str],
    n_results: int = 10,
    client: API = None,
    db_path: str = "/tmp/chromadb.db",
    collection_name: str = "all-my-documents",
    search_string: str = "",
    embedding_model: str = "all-MiniLM-L6-v2",
    vector_database: str = "chromadb",
) -> Dict[str, List[str]]:

We can define different functions to use different vector databases and call them in the current APIs.

def create_chromadb_from_dir():

def query_chromadb():

def create_lancedb_from_dir():

def query_lancedb():

Could you please also update the tests for the new util functions?

Thank you very much again for your contribution.

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments left in the last reply.

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2023

Codecov Report

Merging #25 (9978f60) into main (dc70b80) will decrease coverage by 3.49%.
The diff coverage is 17.94%.

@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
- Coverage   39.98%   36.49%   -3.49%     
==========================================
  Files          17       16       -1     
  Lines        2036     2066      +30     
  Branches      453      458       +5     
==========================================
- Hits          814      754      -60     
- Misses       1149     1242      +93     
+ Partials       73       70       -3     
Flag Coverage Δ
unittests 36.49% <17.94%> (-3.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...gen/agentchat/contrib/retrieve_user_proxy_agent.py 4.39% <0.00%> (-0.03%) ⬇️
autogen/retrieve_utils.py 48.52% <18.42%> (-15.17%) ⬇️

... and 4 files with indirect coverage changes

@akashAD98
Copy link
Author

@thinkall yes, thanks for the reply.yes ill add this sure

@akashAD98
Copy link
Author

@thinkall as per your suggetion i did changes,can you please review it? let me know if anything i need to do or modify.thanks

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @akashAD98 , nice job! I've left some comments, could you please address them?

Could you also add some tests in autogen/test/test_retrieve_utils.py to cover the new functions?

Thank you again for your contribution! Let me know if you need any help.

autogen/retrieve_utils.py Outdated Show resolved Hide resolved
autogen/retrieve_utils.py Show resolved Hide resolved
autogen/retrieve_utils.py Show resolved Hide resolved
autogen/retrieve_utils.py Show resolved Hide resolved
@thinkall
Copy link
Collaborator

thinkall commented Oct 3, 2023

The code format checking is failed, could you please run pre-commit install in your local repo root folder, then you'll enable auto formatting for your code changes.

@thinkall thinkall self-assigned this Oct 3, 2023
removed duplicate code & using same embedding function instead of hugging face
@akashAD98
Copy link
Author

The code format checking is failed, could you please run pre-commit install in your local repo root folder, then you'll enable auto formatting for your code changes.

yes im working on it. thank you so much for your guidance

added  vector_database parameter
@akashAD98
Copy link
Author

Hi @akashAD98 , thank you very much for the updates. I left some comments for the code.

The tests are failed, could you update your code to make sure the tests pass, and also add a test for lancedb?

sorry i missed your msg.yes i m doining

@thinkall
Copy link
Collaborator

Hi @akashAD98 , thank you very much for the updates. I left some comments for the code.
The tests are failed, could you update your code to make sure the tests pass, and also add a test for lancedb?

sorry i missed your msg.yes i m doining

Hi @akashAD98 , I'm thinking about a more general way of supporting different vector dbs in this PR #161 . Could you check if this works for your use case? Thank you very much!

@thinkall
Copy link
Collaborator

I see your reply in #161. Let me try adding your case as a test in my PR.

@akashAD98
Copy link
Author

akashAD98 commented Oct 10, 2023

I see your reply in #161. Let me try adding your case as a test in my PR.

hii
have you tested with lancedb vectordb? for test use case.

8cb1bcd
here its but its failling?

@thinkall
Copy link
Collaborator

I see your reply in #161. Let me try adding your case as a test in my PR.

hii have you tested with lancedb vectordb? for test use case.

8cb1bcd here its but its failling?

Hi @akashAD98 , check the example here:

def test_custom_vector_db(self):

@akashAD98
Copy link
Author

akashAD98 commented Oct 24, 2023

@thinkall I'm running the code in collab, i have defind the class

LancedbRetrieveUserProxyAgent

from typing import Callable, Dict, List, Optional

from overrides import override
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from autogen.retrieve_utils import get_files_from_dir, split_files_to_chunks
import logging

logger = logging.getLogger(__name__)

try:
    import lancedb
except ImportError as e:
    logging.fatal("lancedb is not installed. Try running 'pip install lancedb'")
    raise e        



db_path = "/tmp/lancedb"

def create_lancedb():

    db = lancedb.connect(db_path)
    data = [
                {"vector": [1.1, 1.2], "id": 11, "documents": "This is a test document spark"},
                {"vector": [0.2, 1.8], "id": 22, "documents": "This is another test document"},
                {"vector": [0.1, 0.3], "id": 3, "documents": "This is a third test document spark"},
                {"vector": [0.5, 0.7], "id": 44, "documents": "This is a fourth test document"},
                {"vector": [2.1, 1.3], "id": 55, "documents": "This is a fifth test document spark"},
                {"vector": [5.1, 8.3], "id": 66, "documents": "This is a sixth test document"},
            ]
    try:
        db.create_table("my_table", data)
    except OSError:
        pass

class LancedbRetrieveUserProxyAgent(RetrieveUserProxyAgent):
    def query_vector_db(
        self,
        query_texts,
        n_results=10,
        search_string="",):

        
        if query_texts:
            vector = [0.1, 0.3]
        db = lancedb.connect(db_path)
        table = db.open_table("my_table")
        query = table.search(vector).where(f"documents LIKE '%{search_string}%'").limit(n_results).to_df()
        data ={"ids": query["id"].tolist(), "documents": query["documents"].tolist()}
        return data




    def retrieve_docs(self, problem: str, n_results: int = 20, search_string: str = ""):
        results = self.query_vector_db(
            query_texts=[problem],
            n_results=n_results,
            search_string=search_string,
        )

        self._results = results
        print("doc_ids: ", results["ids"])


from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrive_lancedb import LancedbRetrieveUserProxyAgent

from autogen.agentchat.contrib.retrive_lancedb import create_lancedb
autogen.ChatCompletion.start_logging()

# # 1. create an RetrieveAssistantAgent instance named "assistant"
assistant = RetrieveAssistantAgent(
    name="assistant",
    system_message="You are a helpful assistant.",
    llm_config={
        "request_timeout": 600,
        "seed": 42,
        "config_list": config_list,
    },
)


ragragproxyagent = LancedbRetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=2,
    retrieve_config={
        "task": "qa",
        "chunk_token_size": 2000,
        "client": "__",
        "embedding_model": "all-mpnet-base-v2",
    },
)

create_lancedb()


# reset the assistant. Always reset the assistant before starting a new conversation.


code_problem = "How can I use FLAML to perform a classification task and use spark to do parallel training. Train 30 seconds and force cancel jobs if time limit is reached."
ragragproxyagent.initiate_chat(assistant, problem=code_problem)



& im gettiing bewlow error,so whts wrong here?

doc_ids:  [3, 44, 11, 22, 55, 66]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-14-d39f5a89dd2c>](https://localhost:8080/#) in <cell line: 38>()
     36 
     37 code_problem = "How can I use FLAML to perform a classification task and use spark to do parallel training. Train 30 seconds and force cancel jobs if time limit is reached."
---> 38 ragragproxyagent.initiate_chat(assistant, problem=code_problem)
     39 
     40 

2 frames
[/content/autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py](https://localhost:8080/#) in _get_context(self, results)
    243             if idx <= _doc_idx:
    244                 continue
--> 245             if results["ids"][0][idx] in self._doc_ids:
    246                 continue
    247             _doc_tokens = num_tokens_from_text(doc, custom_token_count_function=self.custom_token_count_function)

TypeError: 'int' object is not subscriptable

@akashAD98 akashAD98 deleted the feature/vectordb_Lancedb branch October 24, 2023 12:28
@thinkall
Copy link
Collaborator

@akashAD98 , could you try update data ={"ids": query["id"].tolist(), "documents": query["documents"].tolist()} to data ={"ids": [query["id"].tolist()], "documents": [query["documents"].tolist()]}

@thinkall thinkall mentioned this pull request Oct 24, 2023
3 tasks
@akashAD98
Copy link
Author

akashAD98 commented Oct 25, 2023

@thinkall yes

its works but its always gives same output

from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
from autogen.agentchat.contrib.retrive_lancedb import LancedbRetrieveUserProxyAgent
from autogen.agentchat.contrib.retrive_lancedb import create_lancedb
autogen.ChatCompletion.start_logging()

# 1. create an RetrieveAssistantAgent instance named "assistant"
assistant = RetrieveAssistantAgent(
    name="assistant",
    system_message="You are a helpful assistant.",
    llm_config={
        "request_timeout": 600,
        "seed": 42,
        "config_list": config_list,
    },
)


ragproxyagent = LancedbRetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    retrieve_config={
        "task": "qa",
        "docs_path": "https://raw.githubusercontent.com/microsoft/autogen/main/README.md",  # change this to your own path, such as https://raw.githubusercontent.com/microsoft/autogen/main/README.md
        "chunk_token_size": 2000,
        "client": "__",
        "embedding_model": "all-mpnet-base-v2",
    },
)


also im not user , adding 'client:'--' it is taking lancedb
? or should i need to define it again tere, also its not override any previous data thats reason i guess its always giving same output

i tried this way but working

table = db.open_table("my_table", mode="overwrite")

db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
docsearch = LanceDB.from_documents(documents, embeddings, connection=table)

image

@thinkall thinkall mentioned this pull request Oct 25, 2023
jackgerrits pushed a commit that referenced this pull request Oct 2, 2024
* make ghClient fetch synchronous

* refactor memory, extract knowledge adding to AiAgent
jackgerrits added a commit that referenced this pull request Oct 2, 2024
…responses (#25)

* rename broadcast to publish

* remove require response, remove responses from publishing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants