Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose source_model parameter for vector-enabled collections #1606

Merged
merged 24 commits into from
Oct 31, 2024

Conversation

Hazel-Datastax
Copy link
Contributor

What this PR does:
Expose source_model parameter for vector-enabled collections

Which issue(s) this PR fixes:
Fixes JiraC2-3495

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CLA Signed: DataStax CLA

@Hazel-Datastax Hazel-Datastax marked this pull request as ready for review October 28, 2024 20:46
@Hazel-Datastax Hazel-Datastax requested a review from a team as a code owner October 28, 2024 20:46
Comment on lines 200 to 229
String sourceModel = vector.sourceModel();
String metric = vector.metric();

// decide sourceModel and metric value
if (sourceModel != null) {
if (metric == null) {
// (1) sourceModel is provided but metric is not - set metric to cosine or dot_product based
// on the map
metric = SUPPORTED_SOURCES.get(sourceModel).getMetric();
}
// (2) both sourceModel and metric are provided - do nothing
} else {
if (metric != null) {
// (3) sourceModel is not provided but metric is - set sourceModel to 'other'
sourceModel = SourceModel.OTHER.getSourceModel();
} else {
// (4) both sourceModel and metric are not provided - set sourceModel to 'other' and metric
// to 'cosine'
sourceModel = SourceModel.OTHER.getSourceModel();
metric = SimilarityFunction.COSINE.getMetric();
}
}

if (service != null) {
// Validate service configuration and auto populate vector dimension.
vectorDimension = validateVectorize.validateService(service, vectorDimension);
vector =
new CreateCollectionCommand.Options.VectorSearchConfig(
vectorDimension, vector.metric(), vector.vectorizeConfig());
vectorDimension, metric, sourceModel, vector.vectorizeConfig());
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to optimize the sourceModel for service? For example, if the user only specifies openai and text-embedding-3-small in the service, do we want to optimize the sourceModel to openai-v3-small. Currently, we will use other as the default value.

Copy link
Collaborator

@vkarpov15 vkarpov15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -141,6 +141,8 @@ public enum ErrorCodeV1 {

VECTOR_SEARCH_INVALID_FUNCTION_NAME("Invalid vector search function name"),

VECTOR_SEARCH_INVALID_SOURCE_MODEL_NAME("Invalid vector search source model name"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but should we use "Unrecognized" instead of "Invalid" for this (I know we use "invalid" above so maybe it's more consistent)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...I used "invalid" because the above used it. Changed to "Unrecognized", should we change the above to "Unrecognized" as well?

if (sourceModel.isEmpty()) return OTHER;
SourceModel model = SOURCE_MODELS_MAP.get(sourceModel);
if (model == null) {
throw ErrorCodeV1.VECTOR_SEARCH_INVALID_SOURCE_MODEL_NAME.toApiException("'%s'", sourceModel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should then also list valid/known source model names, not just invalid/unrecognized value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I followed the pattern in SimilarityFunction. Do we want to change it as well?

Copy link
Contributor

@tatu-at-datastax tatu-at-datastax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but would like some changes as per my comments. Will re-review once conflicts merged.

startsWith(
"Request invalid: field 'command.options.vector.sourceModel' value \"invalidName\" not valid. Problem: sourceModel options are 'openai-v3-large', 'openai-v3-small', 'ada002', 'gecko', 'nv-qa-4', 'cohere-v3', 'bert', and 'other'."));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to test: pass non-String value for sourceModel (like JSON Object) and verify error handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If passing JSON Object, there will be a deserialization error from CreateCollectionCommand. Do we want to use this error?

{
    "errors": [
        {
            "message": "Request invalid, cannot parse as JSON: Cannot deserialize value of type `java.lang.String` from Object value (token `JsonToken.START_OBJECT`)\n at [Source: (ByteArrayInputStream); line: 6, column: 32] (through reference chain: io.stargate.sgv2.jsonapi.api.model.command.impl.CreateCollectionCommand[\"options\"]->io.stargate.sgv2.jsonapi.api.model.command.impl.CreateCollectionCommand$Options[\"vector\"]->io.stargate.sgv2.jsonapi.api.model.command.impl.CreateCollectionCommand$Options$VectorSearchConfig[\"sourceModel\"])",
            "errorCode": "INVALID_REQUEST_NOT_JSON"
        }
    ]
}

…llection_source_model

# Conflicts:
#	src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/TableSchemaObject.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/cqldriver/executor/VectorConfig.java
#	src/main/java/io/stargate/sgv2/jsonapi/service/schema/collections/CollectionSchemaObject.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/embedding/operation/DataVectorizerTest.java
#	src/test/java/io/stargate/sgv2/jsonapi/service/embedding/operation/TestEmbeddingProvider.java
Comment on lines -30 to -32
/** Key for vector function name definition in cql index. */
String VECTOR_INDEX_FUNCTION_NAME = "similarity_function";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated with VectorConstants. I think it's better in there, so I remove the one in here

Copy link
Contributor

@tatu-at-datastax tatu-at-datastax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good: two smaller suggestions, approving, you can consider whether to follow suggestions or not.

@Hazel-Datastax Hazel-Datastax merged commit 69cb961 into main Oct 31, 2024
3 checks passed
@Hazel-Datastax Hazel-Datastax deleted the hazel/collection_source_model branch October 31, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants