-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse index support #424
Sparse index support #424
Conversation
2d00db9
to
c58cba0
Compare
ab1f9e8
to
9d8e5b8
Compare
@@ -115,7 +114,6 @@ jobs: | |||
lz4_version: '${{ matrix.lz4_version }}' | |||
protobuf_version: '${{ matrix.protobuf_version }}' | |||
googleapis_common_protos_version: '${{ matrix.googleapis_common_protos_version }}' | |||
DD_TAGS: 'test.configuration.python:${{ matrix.python_version}},test.configuration.grpcio:${{ matrix.grpcio_version }},test.configuration.lz4:${{ matrix.lz4_version }},test.configuration.protobuf:${{ matrix.protobuf_version }},test.configuration.googleapis_common_protos:${{ matrix.googleapis_common_protos_version }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warnings were showing in the log about these settings for the datadog ddtrace plugin, so I removed them while I was in this file making adjustments.
from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList | ||
|
||
|
||
class PineconeDBControlInterface(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this interface to be a home for docstrings because I was tired of looking at them while trying to edit the Pinecone class. When docs are thorough, you can't see the method signature and the method body at the same time and it annoys me every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. Nice work getting all of this in place while working towards improving the codebase in general. All the changes here make sense to me, and feel like the proper direction to go in compared to where we were at a year ago.
Using buf seems like a good idea.
from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList | ||
|
||
|
||
class PineconeDBControlInterface(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thank you!
Problem
Need to incorporate change for upcoming sparse index support.
Solution
Incorporating these changes represents a few different types of challenges
Control plane changes
vector_type
.Data plane changes
Codegen for proto changes
protoc
CLI from the now-deprecated private repopinecone-protos
. But since we've moved away from that pattern, needed to implement something here for the first time.Dependency changes
Vector object switcheroo for UX
Vector
object generated by OpenAPI for the REST pathway was unfortunately insisting that an empty arrayvalues=[]
be sent when upserting for sparse indexes. This seems annoying and stupid if you are focused on the sparse use case, so I ended up moving away from having users use that generated object in favor of my own custom Vector dataclass. The blast radius of this change was fairly large, requiring updates to the various vector factory classes and hundreds of tests.Vector
inside the SDK:TypedDict
@dataclass
I created in this PRfrom pinecone import Vector
. This was already bad due to the interoperability problems between the GRPC and REST code paths this brought into the vector factory classes (e.g. if you were attempting to pass an OpenAPI Vector object into the GRPC-based upsert implementation extra logic was needed to convert that into the grpc Message object). But with the introduction of sparse, there was no workaround that would give the user the ability to instantiate the OpenAPI vector without also passingvalues=[]
which just seemed like a bridge too far from a UX standpoint. So the decision was made to use this nameVector
to refer to the custom dataclass instead and have that be the main vehicle for user input.I started writing new integration tests for sparse indexes, but they're disabled in this PR because they currently can't run on prod.
Other changes
pinecone/data/types
for dictionary objects the user can pass in various places for vector, sparse values, metadata, filter, etc. This is to increase the consistency and readability of types in the code. When they were defined inline, sometimes small inconsistencies creep in. This wasn't really planned in advance, but I had enough challenges wrangling mypy that it seemed simpler to actually centralize some of these definitions into one place than try to iron out minor inconsistencies any other way.Usage
Type of Change