Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update README: new vecton index syntax #62

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ jobs:
- name: Checkout
uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: 3.12

- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
71 changes: 40 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This is a Python client for TiDB Vector.

> Now only TiDB Cloud Serverless cluster support vector data type, see this [docs](https://docs.pingcap.com/tidbcloud/vector-search-overview?utm_source=github&utm_medium=tidb-vector-python) for more information.
Both TiDB Cloud Serverless ([doc](https://docs.pingcap.com/tidbcloud/vector-search-overview?utm_source=github&utm_medium=tidb-vector-python)) and TiDB Open Source Version (>= 8.4 DMR) support vector data type.

## Installation

Expand Down Expand Up @@ -42,44 +42,51 @@ from tidb_vector.sqlalchemy import VectorType
engine = create_engine('mysql://****.root:******@gateway01.xxxxxx.shared.aws.tidbcloud.com:4000/test')
Base = declarative_base()

class Test(Base):
__tablename__ = 'test'
class Document(Base):
__tablename__ = 'sqlalchemy_documents'
id = Column(Integer, primary_key=True)
embedding = Column(VectorType(3))

# or add hnsw index when creating table
class TestWithIndex(Base):
__tablename__ = 'test_with_index'
id = Column(Integer, primary_key=True)
embedding = Column(VectorType(3), comment="hnsw(distance=l2)")

Base.metadata.create_all(engine)
```

Insert vector data

```python
test = Test(embedding=[1, 2, 3])
session.add(test)
doc = Document(embedding=[1, 2, 3])
session.add(doc)
session.commit()
```

Get the nearest neighbors

```python
session.scalars(select(Test).order_by(Test.embedding.l2_distance([1, 2, 3.1])).limit(5))
session.scalars(select(Document).order_by(Document.embedding.l2_distance([1, 2, 3.1])).limit(5))
```

Get the distance

```python
session.scalars(select(Test.embedding.l2_distance([1, 2, 3.1])))
session.scalars(select(Document.embedding.l2_distance([1, 2, 3.1])))
```

Get within a certain distance

```python
session.scalars(select(Test).filter(Test.embedding.l2_distance([1, 2, 3.1]) < 0.2))
session.scalars(select(Document).filter(Document.embedding.l2_distance([1, 2, 3.1]) < 0.2))
```

Add vector index to speed up query

```python
# vector index currently depends on tiflash
session.execute(text('ALTER TABLE sqlalchemy_documents SET TIFLASH REPLICA 1'))
Comment on lines +82 to +83
Copy link
Member

@breezewish breezewish Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be encapsulated? User don't know what is TiFlash.

In the syntax design, an explicit SET TIFLASH REPLICA is intentional, because otherwise, adding a vector index to an existing table with existing data may cause replicating a huge amount of data (because of implicit set replica 1) which is out of user expectation. However in this driver, the index can be added right after table creation, thus there is no such risk.

On the other hand, when index is added while table is created is allowed without explicitly setting a TiFlash replica factor, because there is no such risk:

CREATE TABLE t(
   vec VECTOR(3),
   VECTOR INDEX ((VEC_COSINE_DISTANCE(vec))
)

index = Index(
'idx_embedding',
func.vec_cosine_distance(Document.embedding),
mysql_prefix="vector",
)
index.create(engine)
```

### Django
Expand Down Expand Up @@ -119,48 +126,50 @@ db = MySQLDatabase(
**connect_kwargs,
)

class TestModel(Model):
class Meta:
database = db
table_name = 'test'

class DocumentModel(Model):
embedding = VectorField(3)

# or add hnsw index when creating table
class TestModelWithIndex(Model):
class Meta:
database = db
table_name = 'test_with_index'

embedding = VectorField(3, constraints=[SQL("COMMENT 'hnsw(distance=l2)'")])

table_name = 'peewee_documents'

db.connect()
db.create_tables([TestModel, TestModelWithIndex])
db.create_tables([DocumentModel])
```

Insert vector data

```python
TestModel.create(embedding=[1, 2, 3])
DocumentModel.create(embedding=[1, 2, 3])
```

Get the nearest neighbors

```python
TestModel.select().order_by(TestModel.embedding.l2_distance([1, 2, 3.1])).limit(5)
DocumentModel.select().order_by(DocumentModel.embedding.l2_distance([1, 2, 3.1])).limit(5)
```

Get the distance

```python
TestModel.select(TestModel.embedding.cosine_distance([1, 2, 3.1]).alias('distance'))
DocumentModel.select(DocumentModel.embedding.cosine_distance([1, 2, 3.1]).alias('distance'))
```

Get within a certain distance

```python
TestModel.select().where(TestModel.embedding.l2_distance([1, 2, 3.1]) < 0.5)
DocumentModel.select().where(DocumentModel.embedding.l2_distance([1, 2, 3.1]) < 0.5)
```

Add vector index to speed up query

```python
# vector index currently depends on tiflash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design and permanent. If TiFlash component is not deployed corresponding operation should fail, and asking user to deploy a TiFlash in order to use Vector Index (just like FULL TEXT INDEX will fail in MySQL if ENGINE is MEMORY).

db.execute_sql(SQL(
"ALTER TABLE peewee_documents SET TIFLASH REPLICA 1;"
))
DocumentModel.add_index(SQL(
"CREATE VECTOR INDEX idx_embedding ON peewee_documents ((vec_cosine_distance(embedding)))"
))
```

### TiDB Vector Client
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ setenv =
skip_install = True
allowlist_externals = bash
deps =
flake8==6.0.0
flake8==7.1.1
black==23.7.0
commands =
bash -c "flake8 --max-line-length 130 tidb_vector tests"
Expand Down
Loading