-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChromaDB DuplicateIDError on memgpt load
#986
Comments
Could you please check your version of chroma, and also provide your python version? I tried loading a file multiple times, but didn't get an error but see this prints:
This is my version of chroma:
|
I am getting the same error. Checked my chromaDB version and it is '0.4.22' |
@sarahwooders Happy to test it with my documents after you give me a green light if that's helpful. Thank you. |
@vinayak-revelation thank you! Could you please try the latest 0.3.2 release? |
Tried it, still getting the same error. File "/home/ubuntu/projects/MemGPT-Prod/memgpt-prod-ve/lib/python3.11/site-packages/chromadb/api/types.py", line 240, in validate_ids Do I need to clear out the chromadb or delete some remnants from previous work before I do this? Also, memgpt version now gives me: |
Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap. |
memgpt load
memgpt load
If you’d like, I could upload the file I was using so you can reproduce the error. |
Yes that would be great! If you can upload here or DM me on discord that'd be very helpful. |
I wil try to get you one. The file I am using right now is protected but let me find something and attach here. |
|
I am trying to find free documents online, and the biggest ones I can find are txt (out of copyright books). Those are able to digest with no issues despite being very large. I tried to convert those txt files into PDFs and they work too. The one interesting data point i do have is those PDFs (my personal health records) were parsing and embedding fine with the memgpt load command couple releases back, the same file though does not work now. I will keep looking for a good example which can help us denug this issue in the meantime. |
I think I found the issue. When you import a PDF, during it's chunking, if it comes across page(s) which are just pasted images with little text, that chunk and another page looks same (if the headers are same) since it cannot parse the image from the PDF and just parses the header or little text there maybe. My suspicion started with this one: mem0ai/mem0#64 and knowing anything other than text in these PDFs is not parsed by the default parser. |
The IDs are created by us (not chroma) as a hash of the text and agent ID (https://github.com/cpacker/MemGPT/blob/main/memgpt/data_types.py#L308) - we implemented this to avoid duplication in the DB, but I think you're right that it's whats causing the issue. I still need to repro, but probably if filter out duplicates before running the ChromaDB insert the issue will be resolved. |
Hi @sarahwooders : I was able to generate an example. Notice the first page has an image and second page is just text. The first page and second page have the same text. So I think the parser creates two embeddings which are essentially the same id and then breaks. This one broke for me when I did: memgpt load directory --name vd1 --input-files ~/Downloads/Example.pdf with the error: chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: e87f024f-2b86-d25d-a381-8d02f038b61d Hope this helps, thank you for helping fix this. I agree filtering duplicates and then insert will work :) |
@vinayak-revelation thanks I was able to get the same error with that example! Fix should be in #1001 |
The fix should be in the nightly package which you can install with |
Describe the bug
When inputting my large .jsonl dataset, Chroma throws an error and the data source is unusable.
Please describe your setup
Screenshots
MemGPT Config
config.txt
Local LLM details
If you are trying to run MemGPT with local LLMs, please provide the following information:
The text was updated successfully, but these errors were encountered: