Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ID Mismatch Error in VectorDB During Evaluation #1033 #1056

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

e7217
Copy link
Contributor

@e7217 e7217 commented Dec 16, 2024

description
Hello

I am suggesting some code changes to address issue #1033. The error occurs when an item in the vectordb is searched, but its ID does not match the ID of the raw_doc corpus. I think the retriever aims to retrieve the item with the highest score. To address this, I have added a key for the content. While this change may require additional storage capacity for the vectordb, it's similar to how Langchain uses a page_content key.

I have modified some code, but I have only referred to the documentation and have not run the code in practice, so there may be errors.

I appreciate your review. Thank you.

references

@hongsw hongsw requested review from hongsw and bwook00 and removed request for hongsw December 16, 2024 03:47
@e7217
Copy link
Contributor Author

e7217 commented Dec 16, 2024

This PR may not fully align with your intentions in autorag. I tried to consider as many cases as possible, but there may be aspects you have been concerned about that I am unaware of. I understand that it might not be approved, but I would appreciate any feedback you can provide. Thank you.

@vkehfdl1
Copy link
Contributor

@e7217 Thank you for the PR! And apologize for the late review.
I will look through it, and will make some changes if needed. Thank you

@vkehfdl1
Copy link
Contributor

@e7217 Actually we discussed about the structure that do not use corpus_df at all for the AutoRAG structure.
@bwook00 I need your opinion about it. I think we have to discuss about it.

Pros

  1. There will be no such things that "no doc_id in vectorddb"
  2. Managing the corpus will be more easier.
  3. Someone only have vectorDB can start AutoRAG easily.

Cons

  1. It will be difficult to synchronize different vector DBs. => Just use one DB? What about embedding model?
  2. The inconsistent result of the one project when vector DB is changed. => Give up the precise experiments?
  3. Managing the right retrieval_gt might be harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants