-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate various implementations of ann search for vector fields #42326
Comments
Pinging @elastic/es-search |
You should also consider Covering-LSH - it, with some recent optimizations, appears to solve the false-negative problem common with LSH approaches. https://arxiv.org/abs/1507.03225 There appear to be some early implementations in C++ already available, however, these do not appear to include the Hadamard code examples in the later white paper revisions that offer performance improvements. |
PUT my_index { "mappings": { "dynamic": "false", "properties": { "my_vector": { "type": "dense_vector_lsh", "dims": 5, "l" : 8, "k" : 5, "w" : 3 } } } } PUT my_index/_doc/1 { "my_vector" : [-0.7862, -0.3771, 0.6328, -0.4836, -0.2992] } GET my_index/_search?_source=false { "query": { "ann": { "field": "my_vector", "number_of_probes": 3, "query_vector": [-0.6, 0, 0.6, -0.4, 0] } } } Relates to elastic#42326
@mayya-sharipova are you interested in using existing commodity approximate nearest neighbors implementations or implementing an algorithm from scratch? Annoy is a simple to use, simple to tune library created by spotify to perform ann search from an index stored on disk. It indexes vectors with integers which I think you also do behind the scenes. It looks like they also have java bindings so it could maybe fit well with your stack https://github.com/spotify/annoy-java |
@markhuds Thanks for your suggestion. We are currently evaluating different ann approaches based on the performance but also based how they fit into our elasticsearch/lucene infrastructure. We have considered spotify annoy's approach as well. We still need to make a decision. We are interested in your use case for the ann search. For what use case have you found annoy compelling? |
@mayya-sharipova Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows? K-means could make sense if the domain was well known, and considerable historical data exists to create meaningful centroids from, and to make the cluster count decision from. K-Means also assumes that you don't expect vector attributes to vary much from the current state, doesn't it? |
Indeed. You are right, a sample on which k-means centroids are found should be representative enough even as the index grows. It seems to me that for all ann algorithms a sample on which you build a model should be representative enough and if your new data vary from the current state than the model has to be rebuilt. We still need to figure out how to organize this in elasticsearch. |
@mayya-sharipova We intend to use output from a deep ranking model that embeds products and queries into a retrieval space such that doing a nearest neighbor search with a query vector will retrieve optimal orderings of product vectors. This will allow us to integrate machine learning techniques to improve overall relevance while maintaining the rich filtering, aggregating, and sorting capabilities we currently enjoy from elasticsearch. |
Procedure to run: 1. use external scipy/scikit-learn kmeans algorithm on your data 2. compute centroids and save them into a file "centroids.txt" as a numpy array in a binary form (use numpy array "tofile" function) 3. also compute labels for all points (what centroids them belong to) and save into a file 4. put centroids.txt file inside elasticsearch/x-pack/plugin/vectors/src/main/resources and build elasticsearch from it and use this build for test 5. from any client create index: { "mappings": { "dynamic": "false", "properties": { "vector": { "type": "dense_vector", "dims": 128 } } } } 6. index your vectors from any client using labels file computed in step 3 { "vector": { "centroid" : 39, "value": [0.12, 22.01, ...] } } 7. find closes points based on ann query { "query": { "script_score": { "query": { "ann": { "field": "vector", "number_of_probes": 3, "query_vector": [3.4, 10.12, ...] } }, "script": { "source": "1 / (1 + l2norm(params.query_vector, doc['vector']))", "params": { "query_vector": [3.4, 10.12, ...] } } } } } Relates to elastic#42326
@mayya-sharipova I'd like to use dense vectors of floats (dimensions of 1x2048 -- last layer from Resnet50) for visual search. |
@RyanZotti Thanks for your usecase. From 7.6 it would be possible to create vectors of 2048 dims (currently they are restricted to 1024 dims). |
When will this be available in the enterprise edition? This is a critical feature for modern search. |
@mayya-sharipova Check this out: https://github.com/facebookresearch/faiss |
I would be nice to use some of these suggested distance metrics in This would eliminate a round-trip network request (getting the source document's vector just to send it right back as part of a new query). |
@dpmccabe theres also the general case of creating a document embedding based on your entire corpus of documents, encoding each document in that embedding, then encoding a search query into that vector space and searching by similarity. It’s a general mechanism for search that can work better than or better in combination with Elasticsearch’s existing methods than what exists currently. It is surprising this doesn’t already exist! |
@mayya-sharipova it's great to see ANN in Elasticsearch is being worked on. Embedding-driven search is something we use here at Cookpad. A scalable solution to nearest vector search within Elasticsearch would be very useful. Much of the rest of our search stack is Elasticsearch, so moving ANN into Elasticsearch is more attractive than monolithic ANN systems (e.g., Faiss, Annoy, etc.). In terms of scale, we're looking to score over a corpus on the order of 10^6 to 10^7 vectors, and the corpus will continue to grow quickly. Metrics-wise, cosine distance is suitable for our current applications of ANN. The primary content so far are text and images – no surprises that one of our key data types here is recipes! Text and image are jointly combined into a single embedding per data item. The upper limit of 1024 to 2048 dimensions will be fine for now. Other features that are interesting:
Very excited about ANN search coming into Elastic. As @rjurney mentioned, it's a cornerstone for modern search. And outside of the commercial space, I've met research scientists and academics who'd love to drop their embeddings into Elasticsearch – rather than Faiss or Annoy – to get the combined benefit of scalable ANN + simple and flexible querying + a route to multi-node parallelism if needed. They'd love if this was as OSS rather than licensed. |
@mayya-sharipova great to see this reopened. There are a heap of application domains where our clients would like to make use of ANN for doc retrieval:
A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ? |
@BountyCountry Thanks for sharing your applications.
This indeed may progress slower, since vectors would be a new format, it would be nice to get an approval from the community to make sure we cover broad enough cases. |
@mayya-sharipova - thanks Maya! Yes we are relatively agnostic to formats although nice to have both dense and spare vector queries available. This repo looks pretty on the money! Have you played with it at all? http://elastiknn.klibisz.com/api/#nearest-neighbor-queries |
LMK if you all have questions or would like to contribute to elastiknn. There's also a rough roadmap in the wiki. |
This has been open for more than a year. Is this feature dead? Are any resources being put into building this or is it still in the investigative stage? Is this at all important to Elastic? This is a basic feature of modern search and Elastic lacks it. @mayya-sharipova Is that going to change in the next six months? Year? Two years? Five years? What is the timeline? |
+1 @rjurney |
@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks. This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback. |
At this point it is crystal clear that this feature will not be solved by Elastic HQ, it will be solved by the community who will advertise that fact here. Then the winning package will be merged into Elastic. |
@alexklibisz we are successfully using your plugin on 7.6.2 with nearest neighbours queries (angular) against dense vectors. Everything worked out of the box and snappy so a huge thank you for your work. @mayya-sharipova - this one needs backing :) |
@mayya-sharipova +1 to @BountyCountry's comment. |
Adding my few cents here, The common problem with most vector store lib(s) is that it doesn't support update/delete an element to/from the existing index. Integrating standard vector search libs to Lucene or ES seems a good idea, since we can make use of its functionalities like merge policy, isAlive bitset and post-processing results. Though, Milvus is built on top of various vectors libs, it handles all these fundamental problems on its own. People who are looking for KNN support on ES, either they can use elastiknn or opendistro-KNN.
I think these both are the right place to start with unless you're waiting for ES to have ANN feature. |
@raman-rajarathinam I have some initial benchmarking results here: http://elastiknn.klibisz.com/performance/. I started working on Elastiknn before opendistro k-nn was released. I haven't measured it exactly, but I wouldn't be surprised if it's faster than elastiknn, since it's using a C library under the hood. However, introducing a subprocess with non-trivial memory usage comes with some deployment tradeoffs like you mentioned. |
Hi @mayya-sharipova. I see that both https://issues.apache.org/jira/browse/LUCENE-9004 and https://issues.apache.org/jira/browse/LUCENE-9322 have been resolved. Does this mean that ANN functionality will be exposed any time soon? |
@mayya-sharipova can you please update this ticket? It seems about a million people need this feature and Lucene has the capability. Where is it? :) |
HNSW has been implemented in preparation for Lucene 9.0, as mentioned above. A pre-requisite to Elasticsearch using that implementation is upgrading to Lucene 9.0 once it's released. The first steps in upgrading are a work-in-progress and you can follow #73324 for more details on that process. To be clear, we don't plan on including Lucene 9.0 in Elasticsearch 7.x but rather starting in 8.0. Once Lucene 9.0 is released and integrated into Elasticsearch 8.0, we will be in a position to create a good integration and experience building and searching ANN indices. Some of the reasons this takes us some time is to ensure that the performance, feature-set and usability are on-par with other Elasticsearch features. When there are some more concrete details to share about ANN, we will certainly do so. We fully acknowledge the demand for this feature! |
A progress update -- we have been collaborating on some Lucene changes to fill in missing features and improve performance:
We're now thinking through how ANN will fit into the Elasticsearch API. I plan to open a "meta issue" soon with a list of tasks we need to complete the initial integration. |
Is there any update on this? Could you also provide a rough time-plan for such an integration? |
Posting a friendly annual reminder that the KNN/ANN feature is already implemented in https://github.com/alexklibisz/elastiknn. :) I'd encourage folks to try out the Elastiknn plugin while waiting on Elastic to integrate Lucene's ANN implementation. Speaking from experience, there are a number of non-trivial problems to solve to ensure a reliable, scalable implementation of ANN. AFAIK, Elastic will be using the HNSW-based ANN implementation from Lucene. Elastiknn uses another approach called LSH (Locality Sensitive Hashing). They have some interesting tradeoffs, which are described at a high level in Pinecone's blog post: https://www.pinecone.io/learn/vector-indexes/. For a deeper dive on LSH, I wrote this post: https://elastiknn.com/posts/tour-de-elastiknn-august-2021/ I'm sure Elastic can eventually get this integrated, but if you're excited to try out ANN for a proof-of-concept or just to learn about KNN/ANN, Elastiknn is ready to try, and I'd be happy to hear any feedback. Cheers |
Here is another Elasticsearch kNN plugin. It allows pre-filtering for multimodal search and scales to billions of documents with ANN. https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf |
I'm sorry for the delay, it took some time to find consensus on a plan. I opened an issue here to track the implementation: #78473. Since this feature will be based on Lucene's new ANN support (which is shipping in its upcoming 9.0 release), it only targets Elasticsearch 8.x. We don't give exact release dates, but for context we are hard at work on both ANN and the 8.0 release. |
I'm going to close this now that we have an implementation plan here: #78473. Thank you so much for the insights and feedback on this issue (and your incredible patience)! |
ann (approximate nearest neighbours) will be a licensed feature of Elasticsearch (not OSS).
We plan to implement prototypes of various algorithms for ann for different distance metrics:
We are interested in users' feedback about:
We have decided to adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136
The text was updated successfully, but these errors were encountered: