Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update README: new vecton index syntax #62

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wd0517
Copy link
Collaborator

@wd0517 wd0517 commented Oct 16, 2024

No description provided.

README.md Outdated Show resolved Hide resolved
Co-authored-by: Mini256 <minianter@foxmail.com>
Comment on lines +82 to +83
# vector index currently depends on tiflash
session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1'))
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be encapsulated? User don't know what is TiFlash.

In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.

On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:

CREATE TABLE t(
   vec VECTOR(3),
   VECTOR INDEX ((VEC_COSINE_DISTANCE(vec))
)

README.md Outdated
Comment on lines 84 to 90
index = Index(
'idx_embedding',
func.vec_cosine_distance(Document.embedding),
mysql_prefix="vector",
mysql_using="hnsw"
)
index.create(engine)
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other vector databases looks like the index is generally specified when table is created. I think this could be a better UX. It may be no need to add index after data is inserted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlalchemy will create the table in two steps: 1. Creating a table without indexes. 2. Creating indexes. Since we need to explicitly set the tiflash replica when creating vector index separately, we cannot define the index in the table class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wd0517 Thanks for the explain. So it is not possible to execute multiple statements there right? Let's raise some discussions with the PM to check how this can be workarounded.

README.md Outdated
Comment on lines 164 to 173
Add hnsw index

```python
# vector index currently depends on tiflash
db.execute_sql(SQL(
"ALTER TABLE peewee_documents SET TIFLASH REPLICA 1;"
))
DocumentModel.add_index(SQL(
"CREATE VECTOR INDEX idx_embedding ON peewee_documents ((vec_cosine_distance(embedding))) USING HNSW"
))
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid explicit SQL? We can follow how pgvector does: https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee

IMO now we provide same capability as pgvector (the index can be created at any time, both when table is created or after table is created), we should provide the same interface and UX as pgvectors', in order to minimize user learning cost.

Copy link
Collaborator Author

@wd0517 wd0517 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peewee does not support the vector prefix for defining an index, so that we still need to use raw sql. https://github.com/coleifer/peewee/blob/master/peewee.py#L2964

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice digging! Looks like it is indeed very hard. When you are free, could you try with something like our own derived Index? Like:

class VectorIndex(peewee.Index):
    ...

Add hnsw index

```python
# vector index currently depends on tiflash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants