New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Write/retrieve chunks using postgres #17

Merged

bjchambers merged 5 commits into main from pgvector-embeddings

Jan 26, 2024

Contributor

bjchambers commented Jan 25, 2024

This removes the dependency on Redis, and makes the chunks/embeddings in the postgres database work.

There are some issues to be addressed, specifically deduplicating cases where multiple embeddings of the same chunk are retrieved. I plan to work on those in a follow-up PR, so that we can get the bulk of this in first.


          feat: Write/retrieve chunks using postgres

47f8392

This removes the dependency on Redis, and makes the chunks/embeddings
in the postgres database work.

There are some issues to be addressed, specifically deduplicating cases
where multiple embeddings of the same chunk are retrieved. I plan to
work on those in a follow-up PR, so that we can get the bulk of this in
first.

bjchambers requested a review from kerinin

January 25, 2024 23:28

bjchambers added 2 commits

January 25, 2024 15:29


          remove unneeded dependencies

cc85481


          some work on tests

e1b0082

kerinin approved these changes

View reviewed changes

dewy/common/collection_embeddings.py Outdated Show resolved Hide resolved

dewy/common/collection_embeddings.py Outdated

+                      FROM relevant_embeddings
+                      JOIN chunk
+                      ON chunk.id = relevant_embeddings.chunk_id
+                      LIMIT $2

Contributor

kerinin Jan 26, 2024

It seems like we could invert this and use SELECT DISTINCT ... from chunk to get the deduplicated chunks.

Contributor Author

bjchambers Jan 26, 2024

Maybe? I want to get something in first and then play with it. I'd like to be able to point a pgsql repl at the database with chunks loaded in, and then see what works (and also use explain to see what the query does, etc.). Deferring.

dewy/common/collection_embeddings.py Show resolved Hide resolved

dewy/common/collection_embeddings.py

+                          url, extract_tables=self.extract_tables, extract_images=self.extract_images
+                      )
+                      if extracted.is_empty():
+                          logger.error(

Contributor

kerinin Jan 26, 2024

If this is an error, shouldn't it throw an exception?

Contributor Author

bjchambers Jan 26, 2024

It could -- but with background tasks, there isn't really anything to do with that error. What I think we actually need to do is mark the document (or ingestion associated with the document) as failed and/or do some kind of dead letter. That said -- perhaps we shouldn't treat this as an error?

dewy/common/collection_embeddings.py

+                              # Then, embed each of those chunks.
+                              # We assume no chunks for the document existed before, so we can iterate
+                              # over the chunks.

Contributor

kerinin Jan 26, 2024

Can we not return the chunk ID's or something? This seems like an assumption that's going to cause bugs as soon as we support updating a document.

Contributor Author

bjchambers Jan 26, 2024

We could. My thinking was that we could write them into the DB rather than trying to keep them in memory and then read them back out. But, I think that both llamaindex and various other embeddings will lead to the whole text having to fit in memory anyway during an ingest, so maybe it doesn't matter.

Contributor Author

bjchambers Jan 26, 2024

Actually, I take that back. There isn't a great way to do that. Specifically:

This uses executemany, which doesn't return anything.
If we use fetch, we can't provide a list of rows to insert -- it needs to be a single query.

I think I'll leave as is for this PR. I think we could handle update in a variety of ways:

Introduce a new document ID and delete the old one.
Add a "version" to each chunk, and query for only the chunks related to the current version.
etc.

Contributor

kerinin Jan 26, 2024

yeah, including an "ingest version" or something that we could filter on the other side would work.

bjchambers added 2 commits

January 25, 2024 22:32


          review comments

7c39b94


          use connection id parameter

b0cb245

bjchambers merged commit cda4f8b into main

bjchambers deleted the pgvector-embeddings branch

January 26, 2024 22:34

bjchambers added the enhancement label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels