ChromaDB DuplicateIDError on `memgpt load` #986

noah53866 · 2024-02-10T02:45:35Z

Describe the bug
When inputting my large .jsonl dataset, Chroma throws an error and the data source is unusable.

Please describe your setup

How did you install memgpt?
- 'pip -U install pymemgpt' and 'pip -U install pymemgpt[local]'
Describe your setup
- Running MemGPT on a WSL Ubuntu distro.
- Running it via Powershell.

Screenshots

MemGPT Config
config.txt

Local LLM details

If you are trying to run MemGPT with local LLMs, please provide the following information:

Mistral-7B-Instruct-v0.2 with a homemade LORA.
WebUI.
32 GB RAM, RTX 3060, AMD Ryzen 7 5800X.

sarahwooders · 2024-02-12T20:00:11Z

Could you please check your version of chroma, and also provide your python version? I tried loading a file multiple times, but didn't get an error but see this prints:

Insert of existing embedding ID: c739a952-e133-f4c3-0f78-c05927587fcd
Insert of existing embedding ID: ba060ef7-eec5-ddf9-8213-04c6001c288f
Insert of existing embedding ID: ed543fd0-3075-7b3e-e5bf-f50daa8f2c74
Insert of existing embedding ID: 63b47f06-b9c1-6c2e-573b-610850b57cdf
Insert of existing embedding ID: 5c0f40c9-d755-b157-0e11-1efb00d6a8ae
Insert of existing embedding ID: 652710fe-d51a-3858-44d9-9acac7d54438
Insert of existing embedding ID: d25693ed-ab08-3f62-10df-1ee97970a4e0
Insert of existing embedding ID: 2e4ce943-d706-2587-7800-629130d37c6e
Insert of existing embedding ID: ee3fa4ef-817e-ea09-dbee-68e44313d329
Insert of existing embedding ID: 6773ac43-e443-90e8-27ca-dfdbce37312e
Insert of existing embedding ID: 275c9be7-6aa7-902c-c436-cfe9697592f1
Insert of existing embedding ID: f578fe6a-53e2-8b42-f098-da574d18638e
Insert of existing embedding ID: dd08b25f-7d65-88d8-6ec1-a1345f8205d1
...

This is my version of chroma:

python
>>> import chromadb
>>> chromadb.__version__
'0.4.22'

vinayak-revelation · 2024-02-12T23:33:22Z

I am getting the same error. Checked my chromaDB version and it is '0.4.22'

vinayak-revelation · 2024-02-13T01:10:44Z

@sarahwooders Happy to test it with my documents after you give me a green light if that's helpful. Thank you.

sarahwooders · 2024-02-13T01:22:10Z

@vinayak-revelation thank you! Could you please try the latest 0.3.2 release?

vinayak-revelation · 2024-02-13T03:09:14Z

Tried it, still getting the same error.

File "/home/ubuntu/projects/MemGPT-Prod/memgpt-prod-ve/lib/python3.11/site-packages/chromadb/api/types.py", line 240, in validate_ids
raise errors.DuplicateIDError(message)
chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: c53f9798-d3ac-6e53-84a2-1ff4f6fb2a4d, e5129649-5ba1-9216-695e-8206a1f8d366, 5096539b-1dfa-8b52-2012-fd0b7e707b97, 88eb4638-2328-d70b-b2e2-879e608ef581, 3697637a-ebb6-70b6-fdf5-e935285deb2e, a5a131e2-a480-8666-ae0c-7b303b8f112a, f576b3d4-b7c8-5a65-3150-bd8198029d61, 6381db94-edaf-75c0-fa82-68ac109f0bf4

Do I need to clear out the chromadb or delete some remnants from previous work before I do this?

Also, memgpt version now gives me: 0.3.2

sarahwooders · 2024-02-13T19:32:13Z

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

noah53866 · 2024-02-13T19:38:20Z

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

If you’d like, I could upload the file I was using so you can reproduce the error.

sarahwooders · 2024-02-13T19:45:44Z

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

vinayak-revelation · 2024-02-13T19:47:01Z

I wil try to get you one. The file I am using right now is protected but let me find something and attach here.
Deeply appreciate your assistance!

noah53866 · 2024-02-13T22:21:16Z

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

bigdata.csv

vinayak-revelation · 2024-02-13T23:05:04Z

I am trying to find free documents online, and the biggest ones I can find are txt (out of copyright books). Those are able to digest with no issues despite being very large. I tried to convert those txt files into PDFs and they work too.

The one interesting data point i do have is those PDFs (my personal health records) were parsing and embedding fine with the memgpt load command couple releases back, the same file though does not work now.

I will keep looking for a good example which can help us denug this issue in the meantime.

vinayak-revelation · 2024-02-13T23:19:12Z

I think I found the issue. When you import a PDF, during it's chunking, if it comes across page(s) which are just pasted images with little text, that chunk and another page looks same (if the headers are same) since it cannot parse the image from the PDF and just parses the header or little text there maybe.
For the error: chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 7f3c6004-8008-9c9a-5655-edc5ee68164f
How are those IDs calculated? Is Chroma doing this and if it is, is there a way to not import when duplicates are found but import the rest?

My suspicion started with this one: mem0ai/mem0#64 and knowing anything other than text in these PDFs is not parsed by the default parser.

sarahwooders · 2024-02-13T23:28:53Z

The IDs are created by us (not chroma) as a hash of the text and agent ID (https://github.com/cpacker/MemGPT/blob/main/memgpt/data_types.py#L308) - we implemented this to avoid duplication in the DB, but I think you're right that it's whats causing the issue. I still need to repro, but probably if filter out duplicates before running the ChromaDB insert the issue will be resolved.

vinayak-revelation · 2024-02-14T00:29:31Z

Example.pdf

Hi @sarahwooders : I was able to generate an example. Notice the first page has an image and second page is just text. The first page and second page have the same text. So I think the parser creates two embeddings which are essentially the same id and then breaks.

This one broke for me when I did:

memgpt load directory --name vd1 --input-files ~/Downloads/Example.pdf

with the error:

chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: e87f024f-2b86-d25d-a381-8d02f038b61d

Hope this helps, thank you for helping fix this. I agree filtering duplicates and then insert will work :)

sarahwooders · 2024-02-14T04:51:54Z

@vinayak-revelation thanks I was able to get the same error with that example! Fix should be in #1001

sarahwooders · 2024-02-14T05:39:31Z

The fix should be in the nightly package which you can install with pip install pymemgpt-nightly and will be in a release in the next 1-2 days

github-project-automation bot added this to 🐛 MemGPT issue tracker Feb 10, 2024

github-project-automation bot moved this to To triage in 🐛 MemGPT issue tracker Feb 10, 2024

ArneJanning mentioned this issue Feb 10, 2024

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

Closed

2 tasks

cpacker assigned sarahwooders Feb 11, 2024

sarahwooders moved this from To triage to Ready in 🐛 MemGPT issue tracker Feb 11, 2024

sarahwooders mentioned this issue Feb 13, 2024

fix: Modify chroma to use collection.upsert instead of collection.add for inserts #996

Merged

sarahwooders changed the title ~~DuplicateIDError~~ Chroma DuplicateIDError on memgpt load Feb 13, 2024

sarahwooders changed the title ~~Chroma DuplicateIDError on memgpt load~~ ChromaDB DuplicateIDError on memgpt load Feb 13, 2024

sarahwooders mentioned this issue Feb 14, 2024

fix: de-duplicate IDs before inserting into ChromaDB #1001

Merged

sarahwooders closed this as completed Feb 14, 2024

github-project-automation bot moved this from Ready to Done in 🐛 MemGPT issue tracker Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChromaDB DuplicateIDError on `memgpt load` #986

ChromaDB DuplicateIDError on `memgpt load` #986

noah53866 commented Feb 10, 2024 •

edited

Loading

sarahwooders commented Feb 12, 2024

vinayak-revelation commented Feb 12, 2024

vinayak-revelation commented Feb 13, 2024

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024 •

edited

Loading

sarahwooders commented Feb 13, 2024

noah53866 commented Feb 13, 2024

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024

noah53866 commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024 •

edited

Loading

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 14, 2024

sarahwooders commented Feb 14, 2024

sarahwooders commented Feb 14, 2024

ChromaDB DuplicateIDError on memgpt load #986

ChromaDB DuplicateIDError on memgpt load #986

Comments

noah53866 commented Feb 10, 2024 • edited Loading

sarahwooders commented Feb 12, 2024

vinayak-revelation commented Feb 12, 2024

vinayak-revelation commented Feb 13, 2024

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024 • edited Loading

sarahwooders commented Feb 13, 2024

noah53866 commented Feb 13, 2024

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024

noah53866 commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024

vinayak-revelation commented Feb 13, 2024 • edited Loading

sarahwooders commented Feb 13, 2024

vinayak-revelation commented Feb 14, 2024

sarahwooders commented Feb 14, 2024

sarahwooders commented Feb 14, 2024

ChromaDB DuplicateIDError on `memgpt load` #986

ChromaDB DuplicateIDError on `memgpt load` #986

noah53866 commented Feb 10, 2024 •

edited

Loading

vinayak-revelation commented Feb 13, 2024 •

edited

Loading

vinayak-revelation commented Feb 13, 2024 •

edited

Loading