Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed RAG data closer to source data? #675

Open
FroMage opened this issue Jun 13, 2024 · 4 comments
Open

Embed RAG data closer to source data? #675

FroMage opened this issue Jun 13, 2024 · 4 comments
Labels
area/embedding-store question Further information is requested

Comments

@FroMage
Copy link

FroMage commented Jun 13, 2024

The current RAG model for pgvector is to store the documents in their own table.

In my application my source documents already have a table:

@Entity
public class Talk extends PanacheEntity {
 public String description;
 public String title;
}

So, when I iterate those to index them, they all go in their separate table:

// DO LLM
Log.infof("Loading data from talks for LLM");
List<Talk> talks = Talk.listAll();
List<Document> documents = new ArrayList<>();
store.removeAll();
Log.infof("Documents: %d", talks.size());
for (Talk talk : talks) {
  Map<String, String> metadata = new HashMap<>();
  metadata.put("title", talk.title);
  metadata.put("id", talk.id.toString());
  if(!talk.description.isBlank()) {
    documents.add(new Document("Title: "+talk.title+"\nID: "+talk.id+"\nDescription: "+talk.description, Metadata.from(metadata)));
  } else {
    Log.infof("Skipping talk %s", talk.getTitle());
  }
}
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(DocumentSplitters.recursive(2000, 0))
                .build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");

This leads me to wonder how I can keep my model and the index in sync. What do I do when I update a single Talk entity? Do I need to re-index the entire store?

Intuitively, I was expecting to be able to do something like:

@Entity
public class Talk extends PanacheEntity {
 public String description;
 public String title;
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 3)
public float[] myvector;

 @IndexProducer
 public Document getDocumentForIndex(){
   if(!description.isBlank()) {
            Map<String, String> metadata = new HashMap<>();
			metadata.put("title", talk.title);
			metadata.put("id", talk.id.toString());
     return new Document("Title: "+title+"\nID: "+id+"\nDescription: "+description, Metadata.from(metadata)));
    } else {
      return null;
    }
 }

	@PreUpdate
	@PrePersist
	public void prePersist() {
		// tell langchain4j to reindex me, somehow
	}
}

But I'm not too sure how to wire this up.

I suppose that a PanacheEmbeddingStore could have this sort of API for batch reindex, given this:

// DO LLM
Log.infof("Loading data from talks for LLM");
List<Document> documents = store.getDocumentsForModel(Talk.class);
store.removeAll();
Log.infof("Injesting LLM");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(DocumentSplitters.recursive(2000, 0))
                .build();
// Warning - this can take a long time...
ingestor.ingest(documents);
Log.infof("Injesting LLM done");
@geoand
Copy link
Collaborator

geoand commented Jun 13, 2024

cc @jmartisk

@jmartisk
Copy link
Collaborator

Quickly updating the corresponding embeddings if one of the source documents changes is unfortunately not trivial, because there's no 'equality' operator in this world.
So, to keep track of which embeddings are related to which document, you probably need to use some sort of ID. That ID can be part of the metadata.

Could you perhaps, in the PrePersist method, take the ID of the document that is being updated, do something like

// remove the original embeddings
embeddingStore.removeAll(new IsEqualTo("id" , talk.id));

// create the new embedding
Map<String, String> metadata = new HashMap<>();
metadata.put("title", talk.title);
metadata.put("id", talk.id.toString());
Document document = new Document("Title: "+talk.title+"\nID: "+talk.id+"\nDescription: "+talk.description, Metadata.from(metadata));
List<TextSegment> split = DocumentSplitters.recursive(2000, 0).split(document);
List<Embedding> newEmbeddings = embeddingModel.embedAll(split).content();
store.addAll(newEmbeddings);

This also depends on the particualar embedding store implementing the removeAll method though, and I think PgVector doesn't support it right now :/

@jmartisk
Copy link
Collaborator

I've just noticed there's also the related langchain4j/langchain4j#1299 (except it's about using the embedding ID instead of one stored as part of the metadata)

@langchain4j
Copy link
Collaborator

PgVectorEmbeddingStore supports removeAll(Filter) where Filter can be metadataKey("id").isEqualTo(talk.id)

@geoand geoand added area/embedding-store question Further information is requested labels Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/embedding-store question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants