Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update README: new vecton index syntax #62

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ jobs:
- name: Checkout
uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.12

- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
72 changes: 41 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This is a Python client for TiDB Vector.

> Now only TiDB Cloud Serverless cluster support vector data type, see this [docs](https://docs.pingcap.com/tidbcloud/vector-search-overview?utm_source=github&utm_medium=tidb-vector-python) for more information.
Both TiDB Cloud Serverless ([doc](https://docs.pingcap.com/tidbcloud/vector-search-overview?utm_source=github&utm_medium=tidb-vector-python)) and TiDB Open Source Version (>= 8.4 DMR) support vector data type.

## Installation

Expand Down Expand Up @@ -42,44 +42,52 @@ from tidb_vector.sqlalchemy import VectorType
engine = create_engine('mysql://****.root:******@gateway01.xxxxxx.shared.aws.tidbcloud.com:4000/test')
Base = declarative_base()

class Test(Base):
__tablename__ = 'test'
class Document(Base):
__tablename__ = 'sqlalchemy_documents'
id = Column(Integer, primary_key=True)
embedding = Column(VectorType(3))

# or add hnsw index when creating table
class TestWithIndex(Base):
__tablename__ = 'test_with_index'
id = Column(Integer, primary_key=True)
embedding = Column(VectorType(3), comment="hnsw(distance=l2)")

Base.metadata.create_all(engine)
```

Insert vector data

```python
test = Test(embedding=[1, 2, 3])
session.add(test)
doc = Document(embedding=[1, 2, 3])
session.add(doc)
session.commit()
```

Get the nearest neighbors

```python
session.scalars(select(Test).order_by(Test.embedding.l2_distance([1, 2, 3.1])).limit(5))
session.scalars(select(Document).order_by(Document.embedding.l2_distance([1, 2, 3.1])).limit(5))
```

Get the distance

```python
session.scalars(select(Test.embedding.l2_distance([1, 2, 3.1])))
session.scalars(select(Document.embedding.l2_distance([1, 2, 3.1])))
```

Get within a certain distance

```python
session.scalars(select(Test).filter(Test.embedding.l2_distance([1, 2, 3.1]) < 0.2))
session.scalars(select(Document).filter(Document.embedding.l2_distance([1, 2, 3.1]) < 0.2))
```

Add hnsw index
wd0517 marked this conversation as resolved.
Show resolved Hide resolved

```python
# vector index currently depends on tiflash
session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1'))
Comment on lines +82 to +83
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be encapsulated? User don't know what is TiFlash.

In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.

On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:

CREATE TABLE t(
   vec VECTOR(3),
   VECTOR INDEX ((VEC_COSINE_DISTANCE(vec))
)

index = Index(
'idx_embedding',
func.vec_cosine_distance(Document.embedding),
mysql_prefix="vector",
mysql_using="hnsw"
wd0517 marked this conversation as resolved.
Show resolved Hide resolved
)
index.create(engine)
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other vector databases looks like the index is generally specified when table is created. I think this could be a better UX. It may be no need to add index after data is inserted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlalchemy will create the table in two steps: 1. Creating a table without indexes. 2. Creating indexes. Since we need to explicitly set the tiflash replica when creating vector index separately, we cannot define the index in the table class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wd0517 Thanks for the explain. So it is not possible to execute multiple statements there right? Let's raise some discussions with the PM to check how this can be workarounded.

```

### Django
Expand Down Expand Up @@ -119,48 +127,50 @@ db = MySQLDatabase(
**connect_kwargs,
)

class TestModel(Model):
class Meta:
database = db
table_name = 'test'

class DocumentModel(Model):
embedding = VectorField(3)

# or add hnsw index when creating table
class TestModelWithIndex(Model):
class Meta:
database = db
table_name = 'test_with_index'

embedding = VectorField(3, constraints=[SQL("COMMENT 'hnsw(distance=l2)'")])

table_name = 'peewee_documents'

db.connect()
db.create_tables([TestModel, TestModelWithIndex])
db.create_tables([DocumentModel])
```

Insert vector data

```python
TestModel.create(embedding=[1, 2, 3])
DocumentModel.create(embedding=[1, 2, 3])
```

Get the nearest neighbors

```python
TestModel.select().order_by(TestModel.embedding.l2_distance([1, 2, 3.1])).limit(5)
DocumentModel.select().order_by(DocumentModel.embedding.l2_distance([1, 2, 3.1])).limit(5)
```

Get the distance

```python
TestModel.select(TestModel.embedding.cosine_distance([1, 2, 3.1]).alias('distance'))
DocumentModel.select(DocumentModel.embedding.cosine_distance([1, 2, 3.1]).alias('distance'))
```

Get within a certain distance

```python
TestModel.select().where(TestModel.embedding.l2_distance([1, 2, 3.1]) < 0.5)
DocumentModel.select().where(DocumentModel.embedding.l2_distance([1, 2, 3.1]) < 0.5)
```

Add hnsw index

```python
# vector index currently depends on tiflash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).

db.execute_sql(SQL(
"ALTER TABLE peewee_documents SET TIFLASH REPLICA 1;"
))
DocumentModel.add_index(SQL(
"CREATE VECTOR INDEX idx_embedding ON peewee_documents ((vec_cosine_distance(embedding))) USING HNSW"
))
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid explicit SQL? We can follow how pgvector does: https://github.com/pgvector/pgvector-python?tab=readme-ov-file#peewee

IMO now we provide same capability as pgvector (the index can be created at any time, both when table is created or after table is created), we should provide the same interface and UX as pgvectors', in order to minimize user learning cost.

Copy link
Collaborator Author

@wd0517 wd0517 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peewee does not support the vector prefix for defining an index, so that we still need to use raw sql. https://github.com/coleifer/peewee/blob/master/peewee.py#L2964

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice digging! Looks like it is indeed very hard. When you are free, could you try with something like our own derived Index? Like:

class VectorIndex(peewee.Index):
    ...

```

### TiDB Vector Client
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ setenv =
skip_install = True
allowlist_externals = bash
deps =
flake8==6.0.0
flake8==7.1.1
black==23.7.0
commands =
bash -c "flake8 --max-line-length 130 tidb_vector tests"
Expand Down
Loading