Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorization on demand #1258

Merged
merged 37 commits into from
Jul 29, 2024
Merged

vectorization on demand #1258

merged 37 commits into from
Jul 29, 2024

Conversation

Yuqi-Du
Copy link
Contributor

@Yuqi-Du Yuqi-Du commented Jul 10, 2024

For these update commands: findOneAndUpdate, updateOne, updateMany, findOneAndReplace
We need to vectorize the update clause or replacement document as needed, that is when:

  • we have document returned by find operation.
    OR
  • upsert, will vectorize no matter what

Since the pre-requisite for vectorization on demand is findOperation's execution., So we can not continue to do vectorization for these four update commands at commandResolver level.Postpone the vectorization in ReadAndUpdateOperation, utilize the documentUpdater, only vectorize as needed.

As you can see, in the DataVectorizer, there will be no more vectorizeUpdateClause, since we postpone the vectorization for updateCommand in operation level. The new refactored documentUpdater will return a updateResponse that contains a possible UpdateEmbeddingOperation. And we can update the $vector after apply vectorization to $vectorize.

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CLA Signed: DataStax CLA

@Yuqi-Du Yuqi-Du requested a review from a team as a code owner July 10, 2024 21:04
@Yuqi-Du Yuqi-Du changed the title Fix update vectorize vectorization on demand Jul 10, 2024
Yuqi-Du added 5 commits July 10, 2024 14:33
# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizerService.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/embedding/operation/DataVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/CommandResolverWithVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/UpdateManyCommandResolverTest.java
# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizerService.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/embedding/operation/DataVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/CommandResolverWithVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/UpdateManyCommandResolverTest.java
Copy link
Collaborator

@vkarpov15 vkarpov15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Yuqi-Du Yuqi-Du marked this pull request as draft July 12, 2024 17:51
Yuqi-Du added 4 commits July 15, 2024 17:12
# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizer.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizerService.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/embedding/operation/DataVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/CommandResolverWithVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/model/impl/UpdateManyCommandResolverTest.java
Copy link
Contributor

@amorton amorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went through, couple of improvements but also 2 bugs

Yuqi-Du added 7 commits July 22, 2024 10:33
# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizer.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizerService.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/operation/collections/ReadAndUpdateCollectionOperation.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/resolver/FindOneAndReplaceCommandResolver.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/resolver/FindOneAndUpdateCommandResolver.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/resolver/UpdateManyCommandResolver.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/resolver/UpdateOneCommandResolver.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/operation/collections/ReadAndUpdateCollectionOperationRetryTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/operation/collections/ReadAndUpdateCollectionOperationTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/operation/collections/SerialConsistencyOverrideOperationTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/CommandResolverWithVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/FindOneAndReplaceCommandResolverTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/resolver/UpdateManyCommandResolverTest.java
@Yuqi-Du Yuqi-Du marked this pull request as ready for review July 24, 2024 00:34
dataVectorizerService.constructDataVectorizer(dataApiRequestInfo, commandContext);
// TODO: only SetOperation and Replacement may create one embeddingUpdateOperation, Refactor
// when there are multiple
final EmbeddingUpdateOperation embeddingUpdateOperation = embeddingUpdateOperations.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pick it as get(0) when we have a list? I understand we support only one vector currently. Would be better to use Multi and merge here, so we don't have to worry about this code in case of multiple vectorize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved

// when there are multiple
final EmbeddingUpdateOperation embeddingUpdateOperation = embeddingUpdateOperations.get(0);
return dataVectorizer
.vectorize(embeddingUpdateOperation.vectorizeContent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this method is called multiple times if the update needs to update on multiple documents in case of UpdateMany. Let's lazy cache the vector for inside the EmbeddingUpdateOperation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part Aaron and I discussed, and decide not do the cache here, possibly another improvemtn
understand that what you did in DataVectorizer previously can avoid this problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will create a ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -110,4 +110,7 @@ public int compare(ActionWithLocator o1, ActionWithLocator o2) {
return o1.path().compareTo(o2.path());
}
}

public record UpdateOperationResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved

@@ -30,7 +30,7 @@ public List<A> actions() {
* @param doc Document to apply operation to
* @return True if document was modified by operation; false if not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the java doc here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved

@@ -227,77 +223,6 @@ public void dynamicFilterCondition() throws Exception {
});
}

@Test
public void dynamicFilterConditionSetVectorize() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing the test can we modify it to show how the data would look like? Will be helpful when changes are done around it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tests in the DocumentUpdaterTest class. Should be clear to see.
Reason of deleting this one, is because we no longer do updateClause vectorize.
Instead, we vectorize the document if found. Here in UpdateManyCommandResolverTest.java, we don't actually mimic document returned by DB.

@@ -420,183 +413,6 @@ public void findOneAndDelete() throws Exception {
});
}

@Test
public void findOneAndReplace() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing can we change the test to show how the object look like when $vectorize is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Yuqi-Du added 2 commits July 26, 2024 12:56
# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/embedding/DataVectorizer.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/updater/DocumentUpdater.java
@Yuqi-Du Yuqi-Du merged commit 6b62bc6 into main Jul 29, 2024
3 checks passed
@Yuqi-Du Yuqi-Du deleted the fix-update-vectorize branch July 29, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants