Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChromaDB DuplicateIDError on memgpt load #986

Closed
5 tasks
noah53866 opened this issue Feb 10, 2024 · 16 comments
Closed
5 tasks

ChromaDB DuplicateIDError on memgpt load #986

noah53866 opened this issue Feb 10, 2024 · 16 comments
Assignees

Comments

@noah53866
Copy link

noah53866 commented Feb 10, 2024

Describe the bug
When inputting my large .jsonl dataset, Chroma throws an error and the data source is unusable.

Please describe your setup

  • How did you install memgpt?
    • 'pip -U install pymemgpt' and 'pip -U install pymemgpt[local]'
  • Describe your setup
    • Running MemGPT on a WSL Ubuntu distro.
    • Running it via Powershell.

Screenshots

image
image

MemGPT Config
config.txt


Local LLM details

If you are trying to run MemGPT with local LLMs, please provide the following information:

  • Mistral-7B-Instruct-v0.2 with a homemade LORA.
  • WebUI.
  • 32 GB RAM, RTX 3060, AMD Ryzen 7 5800X.
@sarahwooders
Copy link
Collaborator

Could you please check your version of chroma, and also provide your python version? I tried loading a file multiple times, but didn't get an error but see this prints:

Insert of existing embedding ID: c739a952-e133-f4c3-0f78-c05927587fcd
Insert of existing embedding ID: ba060ef7-eec5-ddf9-8213-04c6001c288f
Insert of existing embedding ID: ed543fd0-3075-7b3e-e5bf-f50daa8f2c74
Insert of existing embedding ID: 63b47f06-b9c1-6c2e-573b-610850b57cdf
Insert of existing embedding ID: 5c0f40c9-d755-b157-0e11-1efb00d6a8ae
Insert of existing embedding ID: 652710fe-d51a-3858-44d9-9acac7d54438
Insert of existing embedding ID: d25693ed-ab08-3f62-10df-1ee97970a4e0
Insert of existing embedding ID: 2e4ce943-d706-2587-7800-629130d37c6e
Insert of existing embedding ID: ee3fa4ef-817e-ea09-dbee-68e44313d329
Insert of existing embedding ID: 6773ac43-e443-90e8-27ca-dfdbce37312e
Insert of existing embedding ID: 275c9be7-6aa7-902c-c436-cfe9697592f1
Insert of existing embedding ID: f578fe6a-53e2-8b42-f098-da574d18638e
Insert of existing embedding ID: dd08b25f-7d65-88d8-6ec1-a1345f8205d1
...

This is my version of chroma:

python
>>> import chromadb
>>> chromadb.__version__
'0.4.22'

@vinayak-revelation
Copy link
Contributor

I am getting the same error. Checked my chromaDB version and it is '0.4.22'

@vinayak-revelation
Copy link
Contributor

@sarahwooders Happy to test it with my documents after you give me a green light if that's helpful. Thank you.

@sarahwooders
Copy link
Collaborator

@vinayak-revelation thank you! Could you please try the latest 0.3.2 release?

@vinayak-revelation
Copy link
Contributor

vinayak-revelation commented Feb 13, 2024

Tried it, still getting the same error.

File "/home/ubuntu/projects/MemGPT-Prod/memgpt-prod-ve/lib/python3.11/site-packages/chromadb/api/types.py", line 240, in validate_ids
raise errors.DuplicateIDError(message)
chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: c53f9798-d3ac-6e53-84a2-1ff4f6fb2a4d, e5129649-5ba1-9216-695e-8206a1f8d366, 5096539b-1dfa-8b52-2012-fd0b7e707b97, 88eb4638-2328-d70b-b2e2-879e608ef581, 3697637a-ebb6-70b6-fdf5-e935285deb2e, a5a131e2-a480-8666-ae0c-7b303b8f112a, f576b3d4-b7c8-5a65-3150-bd8198029d61, 6381db94-edaf-75c0-fa82-68ac109f0bf4

Do I need to clear out the chromadb or delete some remnants from previous work before I do this?

Also, memgpt version now gives me: 0.3.2

@sarahwooders
Copy link
Collaborator

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

@sarahwooders sarahwooders changed the title DuplicateIDError Chroma DuplicateIDError on memgpt load Feb 13, 2024
@sarahwooders sarahwooders changed the title Chroma DuplicateIDError on memgpt load ChromaDB DuplicateIDError on memgpt load Feb 13, 2024
@noah53866
Copy link
Author

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

If you’d like, I could upload the file I was using so you can reproduce the error.

@sarahwooders
Copy link
Collaborator

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

@vinayak-revelation
Copy link
Contributor

I wil try to get you one. The file I am using right now is protected but let me find something and attach here.
Deeply appreciate your assistance!

@noah53866
Copy link
Author

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

bigdata.csv

@vinayak-revelation
Copy link
Contributor

I am trying to find free documents online, and the biggest ones I can find are txt (out of copyright books). Those are able to digest with no issues despite being very large. I tried to convert those txt files into PDFs and they work too.

The one interesting data point i do have is those PDFs (my personal health records) were parsing and embedding fine with the memgpt load command couple releases back, the same file though does not work now.

I will keep looking for a good example which can help us denug this issue in the meantime.

@vinayak-revelation
Copy link
Contributor

vinayak-revelation commented Feb 13, 2024

I think I found the issue. When you import a PDF, during it's chunking, if it comes across page(s) which are just pasted images with little text, that chunk and another page looks same (if the headers are same) since it cannot parse the image from the PDF and just parses the header or little text there maybe.
For the error: chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 7f3c6004-8008-9c9a-5655-edc5ee68164f
How are those IDs calculated? Is Chroma doing this and if it is, is there a way to not import when duplicates are found but import the rest?

My suspicion started with this one: mem0ai/mem0#64 and knowing anything other than text in these PDFs is not parsed by the default parser.

@sarahwooders
Copy link
Collaborator

The IDs are created by us (not chroma) as a hash of the text and agent ID (https://github.com/cpacker/MemGPT/blob/main/memgpt/data_types.py#L308) - we implemented this to avoid duplication in the DB, but I think you're right that it's whats causing the issue. I still need to repro, but probably if filter out duplicates before running the ChromaDB insert the issue will be resolved.

@vinayak-revelation
Copy link
Contributor

Example.pdf

Hi @sarahwooders : I was able to generate an example. Notice the first page has an image and second page is just text. The first page and second page have the same text. So I think the parser creates two embeddings which are essentially the same id and then breaks.

This one broke for me when I did:

memgpt load directory --name vd1 --input-files ~/Downloads/Example.pdf

with the error:

chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: e87f024f-2b86-d25d-a381-8d02f038b61d

Hope this helps, thank you for helping fix this. I agree filtering duplicates and then insert will work :)

@sarahwooders
Copy link
Collaborator

@vinayak-revelation thanks I was able to get the same error with that example! Fix should be in #1001

@sarahwooders
Copy link
Collaborator

The fix should be in the nightly package which you can install with pip install pymemgpt-nightly and will be in a release in the next 1-2 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants