Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

Closed
Tracked by #1657
juandpinto opened this issue Jan 15, 2024 · 5 comments · Fixed by #2313
Closed
Tracked by #1657

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

juandpinto opened this issue Jan 15, 2024 · 5 comments · Fixed by #2313
Labels
rag retrieve-augmented generative agents

Comments

@juandpinto
Copy link

juandpinto commented Jan 15, 2024

Describe the issue

Is there a way to reuse a RAG vector database using a RetrieveUserProxyAgent rather than recreating it each time I rerun my code? I've saved it locally using "client": chromadb.PersistentClient(path="./chromadb") but now I get UniqueConstraintError: Collection autogen-docs already exists.

Here's the relevant part of my code:

ragproxyagent = RetrieveUserProxyAgent(
    name="ragproxyagent",
    code_execution_config=False,
    retrieve_config={
        "task": "qa",
        "docs_path": "./doc.txt",
        "chunk_token_size": 800,
        "client": chromadb.PersistentClient(path="./chromadb"),
    },
)

Edit

When I set "get_or_create": True I no longer get the error message, but it still recalculates all the vectors anew each time.

@juandpinto
Copy link
Author

I got it to work by removing the docs_path from retrieve_config. Seems a little counter-intuitive but it works.

@ekzhu ekzhu added the rag retrieve-augmented generative agents label Jan 15, 2024
@ekzhu
Copy link
Collaborator

ekzhu commented Jan 15, 2024

@juandpinto thanks for the issue. Do you think there is a better way to handle the retrieve_config? We are currently under-staffed on RAG. You are welcome to make a PR to make it better.

@ekzhu ekzhu changed the title Reuse existing vector database for RAG [Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent Jan 15, 2024
@mferris77
Copy link

I'm running into this issue as well. While I don't have a solution, I must say the limited information I've been able to glean from the tutorials and documentation is counter-intuitive as mentioned.

Mostly what I've found is from one or two of the notebooks:

"get_or_create": True,  # set to False if you don't want to reuse an existing collection, but you'll need to remove the collection manually

...and...

get_or_create (Optional, bool): if True, will create/return a collection for the retrieve chat. This is the same as that used in chromadb. Default is False. Will raise ValueError if the collection already exists and get_or_create is False. Will be set to True if docs_path is None.

How I initially interpreted this based on the name of the function is that it will get a collection if it exists - otherwise it will create it. So I thought if I point it at an 'ingest' folder, it would auto-detect anything it hadn't already parsed.

set to False if you don't want to reuse an existing collection

In my mind, I do want to reuse an existing collection, so I should set this to True!

if True, will create/return a collection for the retrieve chat.

...ok, further confirmation that I want to use True, because it will 'create or return' a collection! (no, it will create AND return!)

However after dabbling with chromadb directly and reviewing their documentation, it seems that chromadb doesn't actually keep track of the names of the files that were parsed - or perhaps autogen isn't doing something it could be to identify which document a chunk comes from (additional metadata, perhaps?).

At the very least, some less ambiguous explanation for how to handle ingesting of documents would be helpful. "If get_or_create = True and docs_path is provided, autogen will ingest files in docs_path even if they have already been ingested. You are responsible for controlling when new documents should be ingested".. or alternatively show how to ingest files outside of the RAG agent config while setting get_or_create = False.

Other thoughts might be to move files into a 'processed' subfolder once ingested, or depending on what's possible with chromadb, adding metadata to the database entries that indicate the file path. I see there's the option to add/update metadata, I also see that documents have a 'uri' property but in my case it's empty for all documents.

...I've just spent some time digging through the code, and it seems with some minor changes to the create_vector_db_from_dir function in retrieve_utils.py, adding a uri metadata should be possible. Right now it looks like the function reads in all available files and immediately chunks them up for inserting into the database - this prevents us from associating the chunk with the filename/path/URI - so maybe we would want to loop through each file individually, get the URI, then check the collection metadata to see if that URI exists - if not, chunk/insert the file and add its URI to the metadata.

You mentioned you're short staffed so I might see if I can get something working myself and will share if I can come up with a good solution.

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 28, 2024

@mferris77 much appreciated!

@Zhuang-Zhuang-Liu
Copy link

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rag retrieve-augmented generative agents
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants