-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue][RAG] Reuse an existing vector database for RetrieveUserProxyAgent #1261
Comments
I got it to work by removing the |
@juandpinto thanks for the issue. Do you think there is a better way to handle the retrieve_config? We are currently under-staffed on RAG. You are welcome to make a PR to make it better. |
I'm running into this issue as well. While I don't have a solution, I must say the limited information I've been able to glean from the tutorials and documentation is counter-intuitive as mentioned. Mostly what I've found is from one or two of the notebooks:
...and...
How I initially interpreted this based on the name of the function is that it will get a collection if it exists - otherwise it will create it. So I thought if I point it at an 'ingest' folder, it would auto-detect anything it hadn't already parsed.
In my mind, I do want to reuse an existing collection, so I should set this to True!
...ok, further confirmation that I want to use True, because it will 'create or return' a collection! (no, it will create AND return!) However after dabbling with chromadb directly and reviewing their documentation, it seems that chromadb doesn't actually keep track of the names of the files that were parsed - or perhaps autogen isn't doing something it could be to identify which document a chunk comes from (additional metadata, perhaps?). At the very least, some less ambiguous explanation for how to handle ingesting of documents would be helpful. "If get_or_create = True and docs_path is provided, autogen will ingest files in docs_path even if they have already been ingested. You are responsible for controlling when new documents should be ingested".. or alternatively show how to ingest files outside of the RAG agent config while setting get_or_create = False. Other thoughts might be to move files into a 'processed' subfolder once ingested, or depending on what's possible with chromadb, adding metadata to the database entries that indicate the file path. I see there's the option to add/update metadata, I also see that documents have a 'uri' property but in my case it's empty for all documents. ...I've just spent some time digging through the code, and it seems with some minor changes to the create_vector_db_from_dir function in retrieve_utils.py, adding a uri metadata should be possible. Right now it looks like the function reads in all available files and immediately chunks them up for inserting into the database - this prevents us from associating the chunk with the filename/path/URI - so maybe we would want to loop through each file individually, get the URI, then check the collection metadata to see if that URI exists - if not, chunk/insert the file and add its URI to the metadata. You mentioned you're short staffed so I might see if I can get something working myself and will share if I can come up with a good solution. |
@mferris77 much appreciated! |
thank you |
Describe the issue
Is there a way to reuse a RAG vector database using a RetrieveUserProxyAgent rather than recreating it each time I rerun my code? I've saved it locally using
"client": chromadb.PersistentClient(path="./chromadb")
but now I getUniqueConstraintError: Collection autogen-docs already exists.
Here's the relevant part of my code:
Edit
When I set
"get_or_create": True
I no longer get the error message, but it still recalculates all the vectors anew each time.The text was updated successfully, but these errors were encountered: