-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue: ANN (Approximate Nearest Neighbors) Index #25
Comments
Would ustream work? https://github.com/unum-cloud/usearch They even have some sqlite stuff already https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md |
usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single |
+1 for LM-DiskANN |
yoo bros, lets do this index okay :)) |
@irowberryFS Interestingly, libSQL uses LM-DiskANN for the vector index: |
For those of you who want this, I'd love to hear your thoughts on these questions:
|
+1 for LM-DiskANN |
Short answerBrute force scales up to 8000 500-word pages or 800 10-page articles on commodity hardware. Longer answerAssuming retrieval should take no longer than 100ms, the brute force approach can handle about 250k embeddings on commodity hardware: import numpy as np
N = 250_000 # Number of embeddings
d = 1024 # The 'median' embedding dimension [1]
embeddings = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)
%timeit embeddings @ query # Core of the brute force approach
# Output on a Google Colab instance:
# 85 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) And assuming you store one embedding per sentence (which would be standard in multi-vector retrieval), that's the equivalent of about 8000 500-word pages, or 800 10-page articles, or 40 200-page books.
I see binary embeddings as an optimisation, not a substitute for floating point embeddings. This optimisation could help to scale brute force with another factor of 2 without sacrificing retrieval quality [2], but definitely not an order of magnitude. For that, we'd need an ANN index.
I'd be happy to sacrifice some insert speed in favour of sub-100ms near-100% recall retrieval for 1–10M+ embedding datasets. [1] https://huggingface.co/spaces/mteb/leaderboard |
I'd like to throw my hat into the ring for LM-DiskANN. Brute force search works great in LanceDB because it uses a contiguous columnar format. In SQLite there is an upper bound to performance because the data is inherently fragmented. By using DiskANN, you can actually leverage this fragmentation and use the B-tree to do the lookups for nearest neighbors. It seems uniquely suited to SQLite's storage engine. Further, If you store vectors and edges in separate columns, you could use SQLite to find nodes without edges computed and then integrate them into the graph incrementally. In my understanding, this was one of the "flaws" of You could even use libSQL's implementation as a starting point, you'd just need to un-libSQL-ify it. |
In addition to Turso using DiskANN, it looks like Microsoft is already using this architecture in production in their Recall product: https://x.com/simonw/status/1798383577421525276 They even store vectors and edges in separate columns, like I suggested! |
@conradev agreed, I think DiskANN will eventually be the way to go! I'm working on metadata filtering right now (#124 ), aiming to get it out in the next 2-3 weeks. After that's done, I'll be focusing on bring ANN algorithms to Here's my rough plan for this:
I'm wary of DiskANN because, in my experience, writes can be pretty slow due to the internal pruning process. I've tried libsql's implementations and building the initial index was pretty slow, even compared to vec0 virtual tables (which are slow compared to regular tables). Also DiskANN implementations can be pretty complex, while IVF will likely be straightforward. I can imagine a future world where In terms of timeline, metadata filters will come out mid-November, so I expect to work on ANN indexes November=December. Hopefully I can get some beta builds out during then, but I won't expect fully-supported ANN inside In the meantime - if anyone else wants to share benchmark numbers of any papers/implementations of DiskANN or other ANN indexes, feel free to share in this thread! |
Recent discussion on SPANN https://news.ycombinator.com/item?id=42028873 |
This is the go to repo for ANN benchmarks: https://github.com/erikbern/ann-benchmarks |
+1 for LM-DiskANN |
sqlite-vec
as ofv0.1.0
will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search beforev1
, which trades accuracy/resource usage for speed.This issue is a general "tracking issue" for how ANN will be implemented in
sqlite-vec
. The open questions I have:Which ANN index should we use?
We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.
The main options I see:
Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.
How should one "declare" an index?
SQLite doesn't have custom indexes, so I think the best way would be to include index info in the
CREATE VIRTUAL TABLE
constructor. Like:create virtual table vec_movies( synopsis_embeddings float[768] INDEXED BY diskann(...) );
or:
syntax heavily depends what ANN index we pick. Also how would training work?
How would they work with metadata filtering?
How do we allow bruteforce + ANN on the same table?
How do we pick between KNN/ANN in a SQL query?
The text was updated successfully, but these errors were encountered: