update README: new vecton index syntax #62

wd0517 · 2024-10-16T13:39:46Z

No description provided.

README.md

Co-authored-by: Mini256 <minianter@foxmail.com>

breezewish · 2024-10-17T01:27:32Z

README.md

+# vector index currently depends on tiflash
+session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1'))


Can this be encapsulated? User don't know what is TiFlash.

In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.

On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:

CREATE TABLE t( vec VECTOR(3), VECTOR INDEX ((VEC_COSINE_DISTANCE(vec)) )

breezewish · 2024-10-17T01:30:46Z

README.md

+index = Index(
+    'idx_embedding',
+    func.vec_cosine_distance(Document.embedding),
+    mysql_prefix="vector",
+    mysql_using="hnsw"
+)
+index.create(engine)


For other vector databases looks like the index is generally specified when table is created. I think this could be a better UX. It may be no need to add index after data is inserted?

sqlalchemy will create the table in two steps: 1. Creating a table without indexes. 2. Creating indexes. Since we need to explicitly set the tiflash replica when creating vector index separately, we cannot define the index in the table class.

@wd0517 Thanks for the explain. So it is not possible to execute multiple statements there right? Let's raise some discussions with the PM to check how this can be workarounded.

breezewish · 2024-10-17T01:32:06Z

README.md

+Add hnsw index
+
+```python
+# vector index currently depends on tiflash
+db.execute_sql(SQL(
+    "ALTER TABLE peewee_documents SET TIFLASH REPLICA 1;"
+))
+DocumentModel.add_index(SQL(
+    "CREATE VECTOR INDEX idx_embedding ON peewee_documents ((vec_cosine_distance(embedding))) USING HNSW"
+))


Can we avoid explicit SQL? We can follow how pgvector does: https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee

IMO now we provide same capability as pgvector (the index can be created at any time, both when table is created or after table is created), we should provide the same interface and UX as pgvectors', in order to minimize user learning cost.

Peewee does not support the vector prefix for defining an index, so that we still need to use raw sql. https://github.com/coleifer/peewee/blob/master/peewee.py#L2964

Nice digging! Looks like it is indeed very hard. When you are free, could you try with something like our own derived Index? Like:

class VectorIndex(peewee.Index): ...

breezewish · 2024-10-17T01:37:51Z

README.md

+Add hnsw index
+
+```python
+# vector index currently depends on tiflash


This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).

README.md

update README

47aee5b

wd0517 requested review from sykp241095, Mini256, IANTHEREAL and Icemap October 16, 2024 13:40

wd0517 added 3 commits October 16, 2024 14:03

fix lint

282fdd8

fix lint

2beacd2

upgrade flake8

5fa716d

Mini256 reviewed Oct 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update README.md

8745e4c

Co-authored-by: Mini256 <minianter@foxmail.com>

breezewish reviewed Oct 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

breezewish reviewed Oct 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

aviod explicitly using HNSW

3a3c981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update README: new vecton index syntax #62

update README: new vecton index syntax #62

wd0517 commented Oct 16, 2024

breezewish Oct 17, 2024 •

edited

Loading

breezewish Oct 17, 2024 •

edited

Loading

wd0517 Oct 17, 2024

breezewish Oct 18, 2024

breezewish Oct 17, 2024 •

edited

Loading

wd0517 Oct 17, 2024 •

edited

Loading

breezewish Oct 18, 2024

breezewish Oct 17, 2024

		# vector index currently depends on tiflash
		session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1'))

update README: new vecton index syntax #62

Are you sure you want to change the base?

update README: new vecton index syntax #62

Conversation

wd0517 commented Oct 16, 2024

breezewish Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

breezewish Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

wd0517 Oct 17, 2024

Choose a reason for hiding this comment

breezewish Oct 18, 2024

Choose a reason for hiding this comment

breezewish Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

wd0517 Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

breezewish Oct 18, 2024

Choose a reason for hiding this comment

breezewish Oct 17, 2024

Choose a reason for hiding this comment

breezewish Oct 17, 2024 •

edited

Loading

breezewish Oct 17, 2024 •

edited

Loading

breezewish Oct 17, 2024 •

edited

Loading

wd0517 Oct 17, 2024 •

edited

Loading