[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

juandpinto · 2024-01-15T15:32:52Z

Describe the issue

Is there a way to reuse a RAG vector database using a RetrieveUserProxyAgent rather than recreating it each time I rerun my code? I've saved it locally using "client": chromadb.PersistentClient(path="./chromadb") but now I get UniqueConstraintError: Collection autogen-docs already exists.

Here's the relevant part of my code:

ragproxyagent = RetrieveUserProxyAgent(
    name="ragproxyagent",
    code_execution_config=False,
    retrieve_config={
        "task": "qa",
        "docs_path": "./doc.txt",
        "chunk_token_size": 800,
        "client": chromadb.PersistentClient(path="./chromadb"),
    },
)

Edit

When I set "get_or_create": True I no longer get the error message, but it still recalculates all the vectors anew each time.

The text was updated successfully, but these errors were encountered:

juandpinto · 2024-01-15T15:55:12Z

I got it to work by removing the docs_path from retrieve_config. Seems a little counter-intuitive but it works.

ekzhu · 2024-01-15T20:24:44Z

@juandpinto thanks for the issue. Do you think there is a better way to handle the retrieve_config? We are currently under-staffed on RAG. You are welcome to make a PR to make it better.

mferris77 · 2024-01-28T17:34:08Z

I'm running into this issue as well. While I don't have a solution, I must say the limited information I've been able to glean from the tutorials and documentation is counter-intuitive as mentioned.

Mostly what I've found is from one or two of the notebooks:

"get_or_create": True,  # set to False if you don't want to reuse an existing collection, but you'll need to remove the collection manually

...and...

get_or_create (Optional, bool): if True, will create/return a collection for the retrieve chat. This is the same as that used in chromadb. Default is False. Will raise ValueError if the collection already exists and get_or_create is False. Will be set to True if docs_path is None.

How I initially interpreted this based on the name of the function is that it will get a collection if it exists - otherwise it will create it. So I thought if I point it at an 'ingest' folder, it would auto-detect anything it hadn't already parsed.

set to False if you don't want to reuse an existing collection

In my mind, I do want to reuse an existing collection, so I should set this to True!

if True, will create/return a collection for the retrieve chat.

...ok, further confirmation that I want to use True, because it will 'create or return' a collection! (no, it will create AND return!)

However after dabbling with chromadb directly and reviewing their documentation, it seems that chromadb doesn't actually keep track of the names of the files that were parsed - or perhaps autogen isn't doing something it could be to identify which document a chunk comes from (additional metadata, perhaps?).

At the very least, some less ambiguous explanation for how to handle ingesting of documents would be helpful. "If get_or_create = True and docs_path is provided, autogen will ingest files in docs_path even if they have already been ingested. You are responsible for controlling when new documents should be ingested".. or alternatively show how to ingest files outside of the RAG agent config while setting get_or_create = False.

Other thoughts might be to move files into a 'processed' subfolder once ingested, or depending on what's possible with chromadb, adding metadata to the database entries that indicate the file path. I see there's the option to add/update metadata, I also see that documents have a 'uri' property but in my case it's empty for all documents.

...I've just spent some time digging through the code, and it seems with some minor changes to the create_vector_db_from_dir function in retrieve_utils.py, adding a uri metadata should be possible. Right now it looks like the function reads in all available files and immediately chunks them up for inserting into the database - this prevents us from associating the chunk with the filename/path/URI - so maybe we would want to loop through each file individually, get the URI, then check the collection metadata to see if that URI exists - if not, chunk/insert the file and add its URI to the metadata.

You mentioned you're short staffed so I might see if I can get something working myself and will share if I can come up with a good solution.

ekzhu · 2024-01-28T19:18:34Z

@mferris77 much appreciated!

Zhuang-Zhuang-Liu · 2024-06-05T10:11:36Z

thank you

ekzhu added the rag retrieve-augmented generative agents label Jan 15, 2024

ekzhu changed the title ~~Reuse existing vector database for RAG~~ [Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent Jan 15, 2024

This was referenced Apr 6, 2024

[Roadmap] RAG #1657

Open

Support setting vector_db as a param #2313

Merged

sonichi closed this as completed in #2313 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

juandpinto commented Jan 15, 2024 •

edited

Loading

juandpinto commented Jan 15, 2024

ekzhu commented Jan 15, 2024

mferris77 commented Jan 28, 2024

ekzhu commented Jan 28, 2024

Zhuang-Zhuang-Liu commented Jun 5, 2024

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261

Comments

juandpinto commented Jan 15, 2024 • edited Loading

Describe the issue

Edit

juandpinto commented Jan 15, 2024

ekzhu commented Jan 15, 2024

mferris77 commented Jan 28, 2024

ekzhu commented Jan 28, 2024

Zhuang-Zhuang-Liu commented Jun 5, 2024

juandpinto commented Jan 15, 2024 •

edited

Loading