Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update README: new vecton index syntax #62
base: main
Are you sure you want to change the base?
update README: new vecton index syntax #62
Changes from 5 commits
47aee5b
282fdd8
2beacd2
5fa716d
8745e4c
3a3c981
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be encapsulated? User don't know what is TiFlash.
In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.
On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other vector databases looks like the index is generally specified when table is created. I think this could be a better UX. It may be no need to add index after data is inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlalchemy will create the table in two steps: 1. Creating a table without indexes. 2. Creating indexes. Since we need to explicitly set the tiflash replica when creating vector index separately, we cannot define the index in the table class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wd0517 Thanks for the explain. So it is not possible to execute multiple statements there right? Let's raise some discussions with the PM to check how this can be workarounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid explicit SQL? We can follow how pgvector does: https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee
IMO now we provide same capability as pgvector (the index can be created at any time, both when table is created or after table is created), we should provide the same interface and UX as pgvectors', in order to minimize user learning cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peewee does not support the
vector
prefix for defining an index, so that we still need to use raw sql. https://github.com/coleifer/peewee/blob/master/peewee.py#L2964There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice digging! Looks like it is indeed very hard. When you are free, could you try with something like our own derived Index? Like: