Investigate various implementations of ann search for vector fields #42326

mayya-sharipova · 2019-05-21T21:06:01Z

ann (approximate nearest neighbours) will be a licensed feature of Elasticsearch (not OSS).

We plan to implement prototypes of various algorithms for ann for different distance metrics:

LSH and multiprobe LSH for euclidean distance
partition trees for euclidean/cosine distance
clustering-based approaches, including product quantization

We are interested in users' feedback about:

application domains
what distance metrics are used (euclidean vs cosine vs Hamming etc).
what are data types in vectors (vectors of integers, bits, longs, floats etc).

We have decided to adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136

elasticmachine · 2019-05-21T21:06:15Z

Pinging @elastic/es-search

wmelton · 2019-06-04T15:57:51Z

You should also consider Covering-LSH - it, with some recent optimizations, appears to solve the false-negative problem common with LSH approaches.

https://arxiv.org/abs/1507.03225

There appear to be some early implementations in C++ already available, however, these do not appear to include the Hadamard code examples in the later white paper revisions that offer performance improvements.

https://github.com/kasperisager/hemingway

PUT my_index { "mappings": { "dynamic": "false", "properties": { "my_vector": { "type": "dense_vector_lsh", "dims": 5, "l" : 8, "k" : 5, "w" : 3 } } } } PUT my_index/_doc/1 { "my_vector" : [-0.7862, -0.3771, 0.6328, -0.4836, -0.2992] } GET my_index/_search?_source=false { "query": { "ann": { "field": "my_vector", "number_of_probes": 3, "query_vector": [-0.6, 0, 0.6, -0.4, 0] } } } Relates to elastic#42326

markhuds · 2019-09-12T00:06:04Z

@mayya-sharipova are you interested in using existing commodity approximate nearest neighbors implementations or implementing an algorithm from scratch? Annoy is a simple to use, simple to tune library created by spotify to perform ann search from an index stored on disk. It indexes vectors with integers which I think you also do behind the scenes. It looks like they also have java bindings so it could maybe fit well with your stack https://github.com/spotify/annoy-java

mayya-sharipova · 2019-09-12T01:07:16Z

@markhuds Thanks for your suggestion. We are currently evaluating different ann approaches based on the performance but also based how they fit into our elasticsearch/lucene infrastructure. We have considered spotify annoy's approach as well. We still need to make a decision.

We are interested in your use case for the ann search. For what use case have you found annoy compelling?

wmelton · 2019-09-12T16:39:29Z

@mayya-sharipova Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

K-means could make sense if the domain was well known, and considerable historical data exists to create meaningful centroids from, and to make the cluster count decision from. K-Means also assumes that you don't expect vector attributes to vary much from the current state, doesn't it?

mayya-sharipova · 2019-09-15T14:18:20Z

@wmelton

Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

Indeed. You are right, a sample on which k-means centroids are found should be representative enough even as the index grows. It seems to me that for all ann algorithms a sample on which you build a model should be representative enough and if your new data vary from the current state than the model has to be rebuilt. We still need to figure out how to organize this in elasticsearch.

markhuds · 2019-09-16T19:18:52Z

@mayya-sharipova We intend to use output from a deep ranking model that embeds products and queries into a retrieval space such that doing a nearest neighbor search with a query vector will retrieve optimal orderings of product vectors. This will allow us to integrate machine learning techniques to improve overall relevance while maintaining the rich filtering, aggregating, and sorting capabilities we currently enjoy from elasticsearch.

Procedure to run: 1. use external scipy/scikit-learn kmeans algorithm on your data 2. compute centroids and save them into a file "centroids.txt" as a numpy array in a binary form (use numpy array "tofile" function) 3. also compute labels for all points (what centroids them belong to) and save into a file 4. put centroids.txt file inside elasticsearch/x-pack/plugin/vectors/src/main/resources and build elasticsearch from it and use this build for test 5. from any client create index: { "mappings": { "dynamic": "false", "properties": { "vector": { "type": "dense_vector", "dims": 128 } } } } 6. index your vectors from any client using labels file computed in step 3 { "vector": { "centroid" : 39, "value": [0.12, 22.01, ...] } } 7. find closes points based on ann query { "query": { "script_score": { "query": { "ann": { "field": "vector", "number_of_probes": 3, "query_vector": [3.4, 10.12, ...] } }, "script": { "source": "1 / (1 + l2norm(params.query_vector, doc['vector']))", "params": { "query_vector": [3.4, 10.12, ...] } } } } } Relates to elastic#42326

RyanZotti · 2019-11-20T17:35:17Z

@mayya-sharipova I'd like to use dense vectors of floats (dimensions of 1x2048 -- last layer from Resnet50) for visual search.

mayya-sharipova · 2019-11-20T17:39:04Z

@RyanZotti Thanks for your usecase. From 7.6 it would be possible to create vectors of 2048 dims (currently they are restricted to 1024 dims).

rjurney · 2019-11-20T21:36:19Z

When will this be available in the enterprise edition? This is a critical feature for modern search.

rjurney · 2019-11-22T17:15:16Z

@mayya-sharipova Check this out: https://github.com/facebookresearch/faiss

dpmccabe · 2019-11-22T21:56:11Z

I would be nice to use some of these suggested distance metrics in more_like_this queries. Suppose I'm storing a dense vector for every document in an index and I feel that this vector represents the document's place in a corpus better than an analyzer+tf-idf could. Then, I could query for similar documents using cosine similarity or some other metric.

This would eliminate a round-trip network request (getting the source document's vector just to send it right back as part of a new query).

rjurney · 2019-11-22T22:55:26Z

@dpmccabe theres also the general case of creating a document embedding based on your entire corpus of documents, encoding each document in that embedding, then encoding a search query into that vector space and searching by similarity. It’s a general mechanism for search that can work better than or better in combination with Elasticsearch’s existing methods than what exists currently.

It is surprising this doesn’t already exist!

mattjw · 2019-11-26T14:10:59Z

@mayya-sharipova it's great to see ANN in Elasticsearch is being worked on. Embedding-driven search is something we use here at Cookpad. A scalable solution to nearest vector search within Elasticsearch would be very useful. Much of the rest of our search stack is Elasticsearch, so moving ANN into Elasticsearch is more attractive than monolithic ANN systems (e.g., Faiss, Annoy, etc.).

In terms of scale, we're looking to score over a corpus on the order of 10^6 to 10^7 vectors, and the corpus will continue to grow quickly. dense_vector is great for a two-stage query-then-rescore approach, but of course will be expensive to scale. (For curious – we ran some Exact NN (ENN) benchmarking experiments here to see how far this would take us.)

Metrics-wise, cosine distance is suitable for our current applications of ANN. The primary content so far are text and images – no surprises that one of our key data types here is recipes! Text and image are jointly combined into a single embedding per data item. The upper limit of 1024 to 2048 dimensions will be fine for now.

Other features that are interesting:

Incrementally write to an ANN index and have it available for search as soon as possible, rather than waiting for a whole re-index.
More-like-this for embeddings similarity.

Very excited about ANN search coming into Elastic. As @rjurney mentioned, it's a cornerstone for modern search. And outside of the commercial space, I've met research scientists and academics who'd love to drop their embeddings into Elasticsearch – rather than Faiss or Annoy – to get the combined benefit of scalable ANN + simple and flexible querying + a route to multi-node parallelism if needed. They'd love if this was as OSS rather than licensed.

dogberto · 2020-07-19T04:17:31Z

@mayya-sharipova great to see this reopened.

There are a heap of application domains where our clients would like to make use of ANN for doc retrieval:

Faster semantic search/boost using word-embeddings
Duplicate detection / record matching of shorter n-gram tf-idf embeddings
Drill-down/crawl-able visualisation of document clusters via successive ann calls, similar to https://cdn.rawgit.com/carrot2/elasticsearch-carrot2/master/doc/examples.html

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

mayya-sharipova · 2020-07-23T13:33:44Z

@BountyCountry Thanks for sharing your applications.

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

This indeed may progress slower, since vectors would be a new format, it would be nice to get an approval from the community to make sure we cover broad enough cases.

dogberto · 2020-07-23T23:07:11Z

@mayya-sharipova - thanks Maya! Yes we are relatively agnostic to formats although nice to have both dense and spare vector queries available.

This repo looks pretty on the money! Have you played with it at all? http://elastiknn.klibisz.com/api/#nearest-neighbor-queries
Perhaps could be added as a supported plugin while the Lucene route is going slow?

alexklibisz · 2020-07-30T18:06:51Z

LMK if you all have questions or would like to contribute to elastiknn. There's also a rough roadmap in the wiki.

rjurney · 2020-08-07T18:30:01Z

This has been open for more than a year. Is this feature dead? Are any resources being put into building this or is it still in the investigative stage? Is this at all important to Elastic? This is a basic feature of modern search and Elastic lacks it.

@mayya-sharipova Is that going to change in the next six months? Year? Two years? Five years? What is the timeline?

joseph-macraty · 2020-09-09T07:09:36Z

+1 @rjurney
With more searches either completely or partially shifting to semantic-based approaches, it's really important to have atleast some implementations of ann search. Since elasticsearch is already integrated with our systems it would be great if this could be done, instead of us having to move to something like FAISS or Annoy.

alexklibisz · 2020-09-09T13:38:26Z

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.
Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.
Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.
Compose nearest neighbor queries with standard Elasticsearch queries.
Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

rjurney · 2020-09-09T22:21:02Z

At this point it is crystal clear that this feature will not be solved by Elastic HQ, it will be solved by the community who will advertise that fact here. Then the winning package will be merged into Elastic.

dogberto · 2020-09-10T07:48:26Z

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.

Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.

Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.

Compose nearest neighbor queries with standard Elasticsearch queries.

Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

@alexklibisz we are successfully using your plugin on 7.6.2 with nearest neighbours queries (angular) against dense vectors. Everything worked out of the box and snappy so a huge thank you for your work. @mayya-sharipova - this one needs backing :)

rjurney · 2020-09-12T19:20:54Z

@mayya-sharipova +1 to @BountyCountry's comment.

raman-r-4978 · 2020-09-15T07:26:21Z

Adding my few cents here,

The common problem with most vector store lib(s) is that it doesn't support update/delete an element to/from the existing index. Integrating standard vector search libs to Lucene or ES seems a good idea, since we can make use of its functionalities like merge policy, isAlive bitset and post-processing results. Though, Milvus is built on top of various vectors libs, it handles all these fundamental problems on its own.

People who are looking for KNN support on ES, either they can use elastiknn or opendistro-KNN.

Opendistro-KNN has integrated nmslib via ES plugin (java bindings) to support KNN. This plugin is also available in AWS. I am pretty sure they chose nmslib by merely seeing its performance on ANN benchmarks. I observed a few things while using Opendistro,
- Graphs are built when you call iw.commit(), and merged when segments gets merged. so if you're going to index (~1M) documents, it is going to take a lot of time.
- Larger the dimension, higher the time to build the graph
- nmslib doesn't read its graph (index files) via mmap, leads to higher memory consumption.
On the other hand, Elastic-KNN implemented everything from scratch. As @alexklibisz said, we really need some benchmark scores to compare Elastic-KNN with other native KNN libs such as (faiss, annoy, nmslib, hnswlib,...etc). I haven't explored Elastic-KNN much, so I can't give my feedback on this.

I think these both are the right place to start with unless you're waiting for ES to have ANN feature.

alexklibisz · 2020-09-15T11:28:16Z

@raman-rajarathinam I have some initial benchmarking results here: http://elastiknn.klibisz.com/performance/. I started working on Elastiknn before opendistro k-nn was released. I haven't measured it exactly, but I wouldn't be surprised if it's faster than elastiknn, since it's using a C library under the hood. However, introducing a subprocess with non-trivial memory usage comes with some deployment tradeoffs like you mentioned.

ometaxas · 2021-02-12T15:18:30Z

We have decided to contribute and adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136

Hi @mayya-sharipova. I see that both https://issues.apache.org/jira/browse/LUCENE-9004 and https://issues.apache.org/jira/browse/LUCENE-9322 have been resolved. Does this mean that ANN functionality will be exposed any time soon?

rjurney · 2021-07-08T19:31:25Z

@mayya-sharipova can you please update this ticket? It seems about a million people need this feature and Lucene has the capability. Where is it? :)

joshdevins · 2021-07-09T14:26:45Z

HNSW has been implemented in preparation for Lucene 9.0, as mentioned above. A pre-requisite to Elasticsearch using that implementation is upgrading to Lucene 9.0 once it's released. The first steps in upgrading are a work-in-progress and you can follow #73324 for more details on that process. To be clear, we don't plan on including Lucene 9.0 in Elasticsearch 7.x but rather starting in 8.0. Once Lucene 9.0 is released and integrated into Elasticsearch 8.0, we will be in a position to create a good integration and experience building and searching ANN indices. Some of the reasons this takes us some time is to ensure that the performance, feature-set and usability are on-par with other Elasticsearch features. When there are some more concrete details to share about ANN, we will certainly do so. We fully acknowledge the demand for this feature!

jtibshirani · 2021-09-01T20:34:32Z

A progress update -- we have been collaborating on some Lucene changes to fill in missing features and improve performance:

Handle deletions in nearest vector search (https://issues.apache.org/jira/browse/LUCENE-10040)
Handle hierarchy in HNSW graph (https://issues.apache.org/jira/browse/LUCENE-10054)
Benchmark Lucene HNSW using ann-benchmarks (https://issues.apache.org/jira/browse/LUCENE-9937)

We're now thinking through how ANN will fit into the Elasticsearch API. I plan to open a "meta issue" soon with a list of tasks we need to complete the initial integration.

ometaxas · 2021-09-27T14:15:22Z

We're now thinking through how ANN will fit into the Elasticsearch API. I plan to open a "meta issue" soon with a list of tasks we need to complete the initial integration.

Is there any update on this? Could you also provide a rough time-plan for such an integration?

alexklibisz · 2021-09-27T17:33:45Z

Posting a friendly annual reminder that the KNN/ANN feature is already implemented in https://github.com/alexklibisz/elastiknn. :)

I'd encourage folks to try out the Elastiknn plugin while waiting on Elastic to integrate Lucene's ANN implementation. Speaking from experience, there are a number of non-trivial problems to solve to ensure a reliable, scalable implementation of ANN.

AFAIK, Elastic will be using the HNSW-based ANN implementation from Lucene. Elastiknn uses another approach called LSH (Locality Sensitive Hashing). They have some interesting tradeoffs, which are described at a high level in Pinecone's blog post: https://www.pinecone.io/learn/vector-indexes/. For a deeper dive on LSH, I wrote this post: https://elastiknn.com/posts/tour-de-elastiknn-august-2021/

I'm sure Elastic can eventually get this integrated, but if you're excited to try out ANN for a proof-of-concept or just to learn about KNN/ANN, Elastiknn is ready to try, and I'd be happy to hear any feedback.

Cheers

plassr · 2021-09-27T18:32:52Z

Here is another Elasticsearch kNN plugin. It allows pre-filtering for multimodal search and scales to billions of documents with ANN. https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

jtibshirani · 2021-09-29T19:26:30Z

Is there any update on this? Could you also provide a rough time-plan for such an integration?

I'm sorry for the delay, it took some time to find consensus on a plan. I opened an issue here to track the implementation: #78473. Since this feature will be based on Lucene's new ANN support (which is shipping in its upcoming 9.0 release), it only targets Elasticsearch 8.x. We don't give exact release dates, but for context we are hard at work on both ANN and the 8.0 release.

jtibshirani · 2021-10-01T17:03:10Z

I'm going to close this now that we have an implementation plan here: #78473. Thank you so much for the insights and feedback on this issue (and your incredible patience)!

mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label May 21, 2019

mayya-sharipova added the >feature label May 21, 2019

mayya-sharipova assigned jtibshirani and mayya-sharipova May 21, 2019

mayya-sharipova mentioned this issue May 22, 2019

Distance measures for dense and sparse vectors #37947

Merged

mayya-sharipova added the feedback_needed label Jun 4, 2019

mayya-sharipova changed the title ~~Investigate various implementations of approximate KNN for vector fields~~ Investigate various implementations of ann search for vector fields Jun 4, 2019

mayya-sharipova mentioned this issue Jul 15, 2019

LSH for approximate nearest neighbour search #44374

Closed

jtibshirani mentioned this issue Oct 21, 2019

Add a field type for high-dimensional bit vectors. #48322

Closed

mayya-sharipova added the stalled label Jul 8, 2020

mayya-sharipova reopened this Jul 8, 2020

raman-r-4978 mentioned this issue Sep 15, 2020

Adding support for FAISS opendistro-for-elasticsearch/k-NN#225

Open

jtibshirani closed this as completed Oct 1, 2021

adibaba mentioned this issue Mar 11, 2022

elastic_transport.ConnectionTimeout dice-group/embeddings.cc#24

Closed

jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Investigate various implementations of ann search for vector fields #42326

Investigate various implementations of ann search for vector fields #42326

Comments

mayya-sharipova commented May 21, 2019 • edited Loading

elasticmachine commented May 21, 2019

wmelton commented Jun 4, 2019

markhuds commented Sep 12, 2019

mayya-sharipova commented Sep 12, 2019

wmelton commented Sep 12, 2019

mayya-sharipova commented Sep 15, 2019

markhuds commented Sep 16, 2019

RyanZotti commented Nov 20, 2019

mayya-sharipova commented Nov 20, 2019

rjurney commented Nov 20, 2019

rjurney commented Nov 22, 2019 • edited Loading

dpmccabe commented Nov 22, 2019

rjurney commented Nov 22, 2019

mattjw commented Nov 26, 2019 • edited Loading

dogberto commented Jul 19, 2020

mayya-sharipova commented Jul 23, 2020

dogberto commented Jul 23, 2020

alexklibisz commented Jul 30, 2020

rjurney commented Aug 7, 2020

joseph-macraty commented Sep 9, 2020

alexklibisz commented Sep 9, 2020

rjurney commented Sep 9, 2020

dogberto commented Sep 10, 2020

rjurney commented Sep 12, 2020

raman-r-4978 commented Sep 15, 2020 • edited Loading

alexklibisz commented Sep 15, 2020 • edited by joshdevins Loading

ometaxas commented Feb 12, 2021

rjurney commented Jul 8, 2021

joshdevins commented Jul 9, 2021

jtibshirani commented Sep 1, 2021

ometaxas commented Sep 27, 2021

alexklibisz commented Sep 27, 2021 • edited Loading

plassr commented Sep 27, 2021

jtibshirani commented Sep 29, 2021

jtibshirani commented Oct 1, 2021

mayya-sharipova commented May 21, 2019 •

edited

Loading

rjurney commented Nov 22, 2019 •

edited

Loading

mattjw commented Nov 26, 2019 •

edited

Loading

raman-r-4978 commented Sep 15, 2020 •

edited

Loading

alexklibisz commented Sep 15, 2020 •

edited by joshdevins

Loading

alexklibisz commented Sep 27, 2021 •

edited

Loading