Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Open
asg017 opened this issue Jun 21, 2024 · 14 comments
Open

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

asg017 opened this issue Jun 21, 2024 · 14 comments

Comments

@asg017
Copy link
Owner

asg017 commented Jun 21, 2024

sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:

Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

  • IVF: Like Faiss IndexIVF, pre-compute centroids of a training dataset, search nearest N centroids, etc.
  • HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
  • DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:

create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);

or:

create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);

syntax heavily depends what ANN index we pick. Also how would training work?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

@asg017 asg017 pinned this issue Jun 23, 2024
@neilkumar
Copy link

Would ustream work?

https://github.com/unum-cloud/usearch

They even have some sqlite stuff already

https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md

@asg017
Copy link
Owner Author

asg017 commented Jun 25, 2024

usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index

Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single .c/.h file. So I don't wanna complicate things with a C++ dependency

@irowberryFS
Copy link

+1 for LM-DiskANN

@di-sukharev
Copy link

di-sukharev commented Sep 2, 2024

yoo bros, lets do this index okay :))

@Boscop
Copy link

Boscop commented Sep 23, 2024

@irowberryFS Interestingly, libSQL uses LM-DiskANN for the vector index:
https://turso.tech/blog/approximate-nearest-neighbor-search-with-diskann-in-libsql

@asg017
Copy link
Owner Author

asg017 commented Oct 8, 2024

For those of you who want this, I'd love to hear your thoughts on these questions:

  • When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?
  • Have you tried scalar/binary quantization to reduce search times? Is the quality reduction good-enough for your use-case?
  • Would "fast inserts" be required in your applications? Some ANN indexes (ex IVF) require a "training process" before inserting real vectors that can take several seconds or minutes, and other ANN indexes can insert in "real-time" (ex DiskANN) but are individually slow as they require a graph traversal. Would your setup prefer one approach over the other?

@imotai
Copy link

imotai commented Oct 9, 2024

+1 for LM-DiskANN

@lsorber
Copy link

lsorber commented Oct 9, 2024

When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?

Short answer

Brute force scales up to 8000 500-word pages or 800 10-page articles on commodity hardware.

Longer answer

Assuming retrieval should take no longer than 100ms, the brute force approach can handle about 250k embeddings on commodity hardware:

import numpy as np

N = 250_000  # Number of embeddings
d = 1024  # The 'median' embedding dimension [1]

embeddings = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)

%timeit embeddings @ query  # Core of the brute force approach

# Output on a Google Colab instance:
# 85 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And assuming you store one embedding per sentence (which would be standard in multi-vector retrieval), that's the equivalent of about 8000 500-word pages, or 800 10-page articles, or 40 200-page books.

Have you tried scalar/binary quantization to reduce search times?

I see binary embeddings as an optimisation, not a substitute for floating point embeddings. This optimisation could help to scale brute force with another factor of 2 without sacrificing retrieval quality [2], but definitely not an order of magnitude. For that, we'd need an ANN index.

Would "fast inserts" be required in your applications?

I'd be happy to sacrifice some insert speed in favour of sub-100ms near-100% recall retrieval for 1–10M+ embedding datasets.

[1] https://huggingface.co/spaces/mteb/leaderboard
[2] https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors

@conradev
Copy link

conradev commented Oct 31, 2024

I'd like to throw my hat into the ring for LM-DiskANN.

Brute force search works great in LanceDB because it uses a contiguous columnar format. In SQLite there is an upper bound to performance because the data is inherently fragmented. By using DiskANN, you can actually leverage this fragmentation and use the B-tree to do the lookups for nearest neighbors. It seems uniquely suited to SQLite's storage engine.

Further, If you store vectors and edges in separate columns, you could use SQLite to find nodes without edges computed and then integrate them into the graph incrementally. In my understanding, this was one of the "flaws" of sqlite-vss: the FAISS index is one big blob and SQLite is not designed to store one big blob, hence the option for faiss_ondisk.

You could even use libSQL's implementation as a starting point, you'd just need to un-libSQL-ify it.

@conradev
Copy link

In addition to Turso using DiskANN, it looks like Microsoft is already using this architecture in production in their Recall product: https://x.com/simonw/status/1798383577421525276

They even store vectors and edges in separate columns, like I suggested!

@asg017
Copy link
Owner Author

asg017 commented Oct 31, 2024

@conradev agreed, I think DiskANN will eventually be the way to go!

I'm working on metadata filtering right now (#124 ), aiming to get it out in the next 2-3 weeks. After that's done, I'll be focusing on bring ANN algorithms to sqlite-vec.

Here's my rough plan for this:

  1. Start with an IVF+kmeans index, because (after training) inserts and KNN queries are pretty quick, and should help scale sqlite-vec to at least ~1 million vectors
  2. Add a proper DiskANN index, which won't require kmeans training

I'm wary of DiskANN because, in my experience, writes can be pretty slow due to the internal pruning process. I've tried libsql's implementations and building the initial index was pretty slow, even compared to vec0 virtual tables (which are slow compared to regular tables). Also DiskANN implementations can be pretty complex, while IVF will likely be straightforward.

I can imagine a future world where sqlite-vec has other ANN index attempts (HNSW, SPANN, etc), both those rely pretty heavily on in-memory data structures and like won't map well to SQLite shadow tables. And they're super complex!

In terms of timeline, metadata filters will come out mid-November, so I expect to work on ANN indexes November=December. Hopefully I can get some beta builds out during then, but I won't expect fully-supported ANN inside sqlite-vec until December/January.

In the meantime - if anyone else wants to share benchmark numbers of any papers/implementations of DiskANN or other ANN indexes, feel free to share in this thread!

@jlarmstrongiv
Copy link

Recent discussion on SPANN https://news.ycombinator.com/item?id=42028873

@SeanPedersen
Copy link

This is the go to repo for ANN benchmarks: https://github.com/erikbern/ann-benchmarks

@andreasmhahn
Copy link

+1 for LM-DiskANN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests