Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delete and ensure add_texts performs upsert (w/ ID optional) #6126

Merged
merged 9 commits into from
Jun 23, 2023

Conversation

rlancemartin
Copy link
Collaborator

@rlancemartin rlancemartin commented Jun 13, 2023

Goal

We want to ensure consistency across vectordbs:
1/ add delete by ID method to the base vectorstore class
2/ ensure add_texts performs upsert with ID optionally passed

Testing

  • Pinecone: notebook test w/ langchain_test vectorstore.
  • Chroma: Review by @jeffchuber, notebook test w/ in memory vectorstore.
  • Supabase: Review by @copple, notebook test w/ langchain_test table.
  • Weaviate: Notebook test w/ langchain_test index.
  • Elastic: Revied by @vestal. Notebook test w/ langchain_test table.
  • Redis: Asked for review from owner of recent delete method added redis method to delete entries by keys #6222

@vercel
Copy link

vercel bot commented Jun 15, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jun 23, 2023 6:35pm

@vercel vercel bot temporarily deployed to Preview June 15, 2023 23:47 Inactive
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from 4f1ebc3 to f428280 Compare June 15, 2023 23:49
@vercel vercel bot temporarily deployed to Preview June 15, 2023 23:49 Inactive
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from f428280 to e3ac198 Compare June 15, 2023 23:50
@vercel vercel bot temporarily deployed to Preview June 15, 2023 23:50 Inactive
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from e3ac198 to 8e363ed Compare June 20, 2023 20:59
@vercel
Copy link

vercel bot commented Jun 20, 2023

@rlancemartin is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from 8e363ed to 9a27c76 Compare June 20, 2023 21:01
@rlancemartin rlancemartin changed the title [Draft] Notebook to test / audit index read / write for various vectorDBs [Draft] Standard add, update, delete by ID interface for vectorDBs Jun 20, 2023
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch 3 times, most recently from 85e941d to 8572885 Compare June 21, 2023 21:27
]

# Handle each insert individually to avoid conflicting IDs
for row in rows:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. May slow adds considerably. Need to see if there is a better way.

@rlancemartin rlancemartin changed the title [Draft] Standard add, update, delete by ID interface for vectorDBs Add delete_by_id and update add_texts to use IDs / upsert for several vectorstores Jun 22, 2023
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch 3 times, most recently from c261dd2 to 4c53e18 Compare June 22, 2023 21:26
@rlancemartin rlancemartin changed the title Add delete_by_id and update add_texts to use IDs / upsert for several vectorstores Add delete method and ensure add_texts perform upsert by ID for sub-set of vectordbs Jun 22, 2023
@rlancemartin rlancemartin changed the title Add delete method and ensure add_texts perform upsert by ID for sub-set of vectordbs Add delete and ensure add_texts perform upsert by optional ID for sub-set of vectordbs Jun 22, 2023
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from 4c53e18 to adf5d3c Compare June 22, 2023 21:29
@rlancemartin rlancemartin changed the title Add delete and ensure add_texts perform upsert by optional ID for sub-set of vectordbs Add delete and ensure add_texts performs upsert w/ optional Jun 22, 2023
@rlancemartin rlancemartin changed the title Add delete and ensure add_texts performs upsert w/ optional Add delete and ensure add_texts performs upsert (w/ ID optional) Jun 22, 2023
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch 4 times, most recently from 7ae6cc1 to c7e85cb Compare June 22, 2023 21:57
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch 2 times, most recently from c852a82 to 2c9d66b Compare June 22, 2023 22:44
@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from 2c9d66b to b03e03d Compare June 22, 2023 23:04
@jeffchuber
Copy link
Contributor

lgtm! 👍 🎉

@@ -461,19 +464,24 @@ def from_texts(

@staticmethod
def delete(
keys: List[str],
ids: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets just change this, since its an arg and idk how well used i dont think we need to worry about backwards compat

@rlancemartin rlancemartin force-pushed the rlm/vectordb_testing branch from 8efe420 to 40a32d8 Compare June 23, 2023 18:35
@rlancemartin rlancemartin merged commit be02572 into langchain-ai:master Jun 23, 2023
This was referenced Jun 25, 2023
@michaeltansg
Copy link

@rlancemartin Is it also the intention for this change to standardize the vector DB ids to be UUIDs? The documentation for the supabase integration ( python / javascript ) suggests BIGSERIAL as the type. The underlying database is PostgreSQL with a PGVector extension.

At this moment, if we call the add_documents method, it does not send a list of ids to add_texts method.

base.py:
    def add_documents(self, documents: List[Document], **kwargs: Any) -> List[str]:
        """Run more documents through the embeddings and add to the vectorstore.

        Args:
            documents (List[Document]: Documents to add to the vectorstore.

        Returns:
            List[str]: List of IDs of the added texts.
        """
        # TODO: Handle the case where the user doesn't provide ids on the Collection
        texts = [doc.page_content for doc in documents]
        metadatas = [doc.metadata for doc in documents]
        return self.add_texts(texts, metadatas, **kwargs)

supabase.py:
    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict[Any, Any]]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        ids = ids or [str(uuid.uuid4()) for _ in texts]
        docs = self._texts_to_documents(texts, metadatas)

        vectors = self._embedding.embed_documents(list(texts))
        return self.add_vectors(vectors, docs, ids)

The add_texts method then creates UUIDs and the upsert will fail if the table was declared like so:

-- Create a table to store your documents
create table documents (
  id bigserial primary key,
  content text,
  metadata jsonb,
  embedding vector(1536)
);

Any recommendations here?

@rlancemartin
Copy link
Collaborator Author

Ah, I see. Will BIGSERIAL always be used as Supabase primary key? If so, we can modify ID. Feel free to put up a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants