Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify ID when adding to the FAISS vectorstore. #5190

Merged
merged 6 commits into from
May 25, 2023
Merged

Allow to specify ID when adding to the FAISS vectorstore. #5190

merged 6 commits into from
May 25, 2023

Conversation

atisharma
Copy link
Contributor

@atisharma atisharma commented May 24, 2023

Allow to specify ID when adding to the FAISS vectorstore

This change allows unique IDs to be specified when adding documents / embeddings to a faiss vectorstore.

  • This reflects the current approach with the chroma vectorstore.
  • It allows rejection of inserts on duplicate IDs
  • will allow deletion / update by searching on deterministic ID (such as a hash).
  • If not specified, a random UUID is generated (as per previous behaviour, so non-breaking).

This commit fixes #5065 and #3896 and should fix #2699 indirectly. I've tested adding and merging.

Kindly tagging @Xmaster6y @dev2049 for review.

This change allows unique IDs to be specified when adding documents /
embeddings to a faiss vectorstore.
This reflects the current approach with the chroma vectorstore. It
allows rejection of inserts on duplicate IDs and will allow deletion /
update by searching on deterministic ID (such as a hash).

This commit solves #5065 and #3896 and should solve #2699 indirectly.
@Xmaster6y
Copy link
Contributor

Xmaster6y commented May 24, 2023

I think that all the add methods should have the extra id parameter.

Specifying the ids in the from class methods should also be possible.

@atisharma
Copy link
Contributor Author

Why?

They all pass through the ids argument to the underlying __add or __from.

Is it for the documentation?

This gives visibility in the documentation.
@atisharma
Copy link
Contributor Author

I added the change you suggested.

@Xmaster6y
Copy link
Contributor

Xmaster6y commented May 24, 2023

__add is not meant to be used outside the class as a "private method".

I mean how do you specify the ids if you use add_texts?

@@ -432,6 +443,7 @@ def from_embeddings(
text_embeddings: List[Tuple[str, List[float]]],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be passed to __from right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -161,6 +167,7 @@ def add_embeddings(
text_embeddings: Iterable pairs of string and embedding to
add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
ids: Optional list of unique IDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be passed to __add right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

**kwargs: Any,
) -> List[str]:
"""Run more texts through the embeddings and add to the vectorstore.

Args:
texts: Iterable of strings to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
ids: Optional list of unique IDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be passed to __add right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This is now necessary because they are listed in the input arguments
explicitly.
@atisharma
Copy link
Contributor Author

atisharma commented May 24, 2023

__add is not meant to be used outside the class as a "private method".

I know.

I mean how do you specify the ids if you use add_texts?

This wasn't necessary when the ids were not explicitly listed as arguments for the public access methods, because the ids arguments would pass through to the private __add and __from methods through the **kwargs.

After the change you requested, yes, I should have passed them explicitly.

Ati Sharma and others added 3 commits May 24, 2023 23:22
Fix typo that would silently cause ids to be a sequential index instead
of the intended id, when using __from.
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@hwchase17 hwchase17 added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 25, 2023
@hwchase17 hwchase17 merged commit 40b086d into langchain-ai:master May 25, 2023
@atisharma atisharma deleted the faiss_id branch May 25, 2023 08:50
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
…ai#5190)

# Allow to specify ID when adding to the FAISS vectorstore

This change allows unique IDs to be specified when adding documents /
embeddings to a faiss vectorstore.

- This reflects the current approach with the chroma vectorstore.
- It allows rejection of inserts on duplicate IDs
- will allow deletion / update by searching on deterministic ID (such as
a hash).
- If not specified, a random UUID is generated (as per previous
behaviour, so non-breaking).

This commit fixes langchain-ai#5065 and langchain-ai#3896 and should fix langchain-ai#2699 indirectly. I've
tested adding and merging.

Kindly tagging @Xmaster6y @dev2049 for review.

---------

Co-authored-by: Ati Sharma <ati@agalmic.ltd>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
3 participants