Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate various implementations of ann search for vector fields #42326

Closed
mayya-sharipova opened this issue May 21, 2019 · 54 comments
Closed
Assignees
Labels
>feature feedback_needed :Search Relevance/Vectors Vector search stalled Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 21, 2019

ann (approximate nearest neighbours) will be a licensed feature of Elasticsearch (not OSS).

We plan to implement prototypes of various algorithms for ann for different distance metrics:

  • LSH and multiprobe LSH for euclidean distance
  • partition trees for euclidean/cosine distance
  • clustering-based approaches, including product quantization

We are interested in users' feedback about:

  • application domains
  • what distance metrics are used (euclidean vs cosine vs Hamming etc).
  • what are data types in vectors (vectors of integers, bits, longs, floats etc).

We have decided to adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136

@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label May 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@mayya-sharipova mayya-sharipova changed the title Investigate various implementations of approximate KNN for vector fields Investigate various implementations of ann search for vector fields Jun 4, 2019
@wmelton
Copy link

wmelton commented Jun 4, 2019

You should also consider Covering-LSH - it, with some recent optimizations, appears to solve the false-negative problem common with LSH approaches.

https://arxiv.org/abs/1507.03225

There appear to be some early implementations in C++ already available, however, these do not appear to include the Hadamard code examples in the later white paper revisions that offer performance improvements.

https://github.com/kasperisager/hemingway

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jul 15, 2019
PUT my_index
{
  "mappings": {
    "dynamic": "false",
    "properties": {
      "my_vector": {
        "type": "dense_vector_lsh",
        "dims": 5,
        "l" : 8,
        "k" : 5,
        "w" : 3
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_vector" : [-0.7862,  -0.3771, 0.6328, -0.4836, -0.2992]
}

GET my_index/_search?_source=false
{
  "query": {
    "ann": {
      "field": "my_vector",
      "number_of_probes": 3,
      "query_vector": [-0.6, 0, 0.6, -0.4, 0]
    }
  }
}

Relates to elastic#42326
@markhuds
Copy link

@mayya-sharipova are you interested in using existing commodity approximate nearest neighbors implementations or implementing an algorithm from scratch? Annoy is a simple to use, simple to tune library created by spotify to perform ann search from an index stored on disk. It indexes vectors with integers which I think you also do behind the scenes. It looks like they also have java bindings so it could maybe fit well with your stack https://github.com/spotify/annoy-java

@mayya-sharipova
Copy link
Contributor Author

@markhuds Thanks for your suggestion. We are currently evaluating different ann approaches based on the performance but also based how they fit into our elasticsearch/lucene infrastructure. We have considered spotify annoy's approach as well. We still need to make a decision.

We are interested in your use case for the ann search. For what use case have you found annoy compelling?

@wmelton
Copy link

wmelton commented Sep 12, 2019

@mayya-sharipova Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

K-means could make sense if the domain was well known, and considerable historical data exists to create meaningful centroids from, and to make the cluster count decision from. K-Means also assumes that you don't expect vector attributes to vary much from the current state, doesn't it?

@mayya-sharipova
Copy link
Contributor Author

@wmelton

Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

Indeed. You are right, a sample on which k-means centroids are found should be representative enough even as the index grows. It seems to me that for all ann algorithms a sample on which you build a model should be representative enough and if your new data vary from the current state than the model has to be rebuilt. We still need to figure out how to organize this in elasticsearch.

@markhuds
Copy link

@mayya-sharipova We intend to use output from a deep ranking model that embeds products and queries into a retrieval space such that doing a nearest neighbor search with a query vector will retrieve optimal orderings of product vectors. This will allow us to integrate machine learning techniques to improve overall relevance while maintaining the rich filtering, aggregating, and sorting capabilities we currently enjoy from elasticsearch.

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Sep 20, 2019
Procedure to run:
1. use external scipy/scikit-learn kmeans algorithm on your data
2. compute centroids and save them into a file "centroids.txt" as
    a numpy array in a binary form (use numpy array "tofile" function)
3. also compute labels for all points (what centroids them belong to)
    and save into a file
4. put centroids.txt file inside elasticsearch/x-pack/plugin/vectors/src/main/resources
    and build elasticsearch from it and use this build for test
5. from any client create index:
{
  "mappings": {
    "dynamic": "false",
    "properties": {
      "vector": {
        "type": "dense_vector",
        "dims": 128
      }
    }
  }
}
6. index your vectors from any client using labels file computed in step 3
{
  "vector": {
  	"centroid" : 39,
        "value": [0.12, 22.01, ...]

  }
}
7. find closes points based on ann query
{
  "query": {
    "script_score": {
      "query": {
        "ann": {
          "field": "vector",
          "number_of_probes": 3,
          "query_vector": [3.4, 10.12, ...]
        }
      },
      "script": {
        "source": "1 / (1 + l2norm(params.query_vector, doc['vector']))",
        "params": {
          "query_vector": [3.4, 10.12, ...]
        }
      }
    }
  }
}

Relates to  elastic#42326
@RyanZotti
Copy link

@mayya-sharipova I'd like to use dense vectors of floats (dimensions of 1x2048 -- last layer from Resnet50) for visual search.

@mayya-sharipova
Copy link
Contributor Author

@RyanZotti Thanks for your usecase. From 7.6 it would be possible to create vectors of 2048 dims (currently they are restricted to 1024 dims).

@rjurney
Copy link

rjurney commented Nov 20, 2019

When will this be available in the enterprise edition? This is a critical feature for modern search.

@rjurney
Copy link

rjurney commented Nov 22, 2019

@mayya-sharipova Check this out: https://github.com/facebookresearch/faiss

@dpmccabe
Copy link

I would be nice to use some of these suggested distance metrics in more_like_this queries. Suppose I'm storing a dense vector for every document in an index and I feel that this vector represents the document's place in a corpus better than an analyzer+tf-idf could. Then, I could query for similar documents using cosine similarity or some other metric.

This would eliminate a round-trip network request (getting the source document's vector just to send it right back as part of a new query).

@rjurney
Copy link

rjurney commented Nov 22, 2019

@dpmccabe theres also the general case of creating a document embedding based on your entire corpus of documents, encoding each document in that embedding, then encoding a search query into that vector space and searching by similarity. It’s a general mechanism for search that can work better than or better in combination with Elasticsearch’s existing methods than what exists currently.

It is surprising this doesn’t already exist!

@mattjw
Copy link

mattjw commented Nov 26, 2019

@mayya-sharipova it's great to see ANN in Elasticsearch is being worked on. Embedding-driven search is something we use here at Cookpad. A scalable solution to nearest vector search within Elasticsearch would be very useful. Much of the rest of our search stack is Elasticsearch, so moving ANN into Elasticsearch is more attractive than monolithic ANN systems (e.g., Faiss, Annoy, etc.).

In terms of scale, we're looking to score over a corpus on the order of 10^6 to 10^7 vectors, and the corpus will continue to grow quickly. dense_vector is great for a two-stage query-then-rescore approach, but of course will be expensive to scale. (For curious – we ran some Exact NN (ENN) benchmarking experiments here to see how far this would take us.)

Metrics-wise, cosine distance is suitable for our current applications of ANN. The primary content so far are text and images – no surprises that one of our key data types here is recipes! Text and image are jointly combined into a single embedding per data item. The upper limit of 1024 to 2048 dimensions will be fine for now.

Other features that are interesting:

  • Incrementally write to an ANN index and have it available for search as soon as possible, rather than waiting for a whole re-index.
  • More-like-this for embeddings similarity.

Very excited about ANN search coming into Elastic. As @rjurney mentioned, it's a cornerstone for modern search. And outside of the commercial space, I've met research scientists and academics who'd love to drop their embeddings into Elasticsearch – rather than Faiss or Annoy – to get the combined benefit of scalable ANN + simple and flexible querying + a route to multi-node parallelism if needed. They'd love if this was as OSS rather than licensed.

@dogberto
Copy link

@mayya-sharipova great to see this reopened.

There are a heap of application domains where our clients would like to make use of ANN for doc retrieval:

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

@mayya-sharipova
Copy link
Contributor Author

@BountyCountry Thanks for sharing your applications.

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

This indeed may progress slower, since vectors would be a new format, it would be nice to get an approval from the community to make sure we cover broad enough cases.

@dogberto
Copy link

@mayya-sharipova - thanks Maya! Yes we are relatively agnostic to formats although nice to have both dense and spare vector queries available.

This repo looks pretty on the money! Have you played with it at all? http://elastiknn.klibisz.com/api/#nearest-neighbor-queries
Perhaps could be added as a supported plugin while the Lucene route is going slow?

@alexklibisz
Copy link

LMK if you all have questions or would like to contribute to elastiknn. There's also a rough roadmap in the wiki.

@rjurney
Copy link

rjurney commented Aug 7, 2020

This has been open for more than a year. Is this feature dead? Are any resources being put into building this or is it still in the investigative stage? Is this at all important to Elastic? This is a basic feature of modern search and Elastic lacks it.

@mayya-sharipova Is that going to change in the next six months? Year? Two years? Five years? What is the timeline?

@joseph-macraty
Copy link

+1 @rjurney
With more searches either completely or partially shifting to semantic-based approaches, it's really important to have atleast some implementations of ann search. Since elasticsearch is already integrated with our systems it would be great if this could be done, instead of us having to move to something like FAISS or Annoy.

@alexklibisz
Copy link

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

  • Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.
  • Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.
  • Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.
  • Compose nearest neighbor queries with standard Elasticsearch queries.
  • Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

@rjurney
Copy link

rjurney commented Sep 9, 2020

At this point it is crystal clear that this feature will not be solved by Elastic HQ, it will be solved by the community who will advertise that fact here. Then the winning package will be merged into Elastic.

@dogberto
Copy link

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

  • Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.
  • Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.
  • Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.
  • Compose nearest neighbor queries with standard Elasticsearch queries.
  • Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

@alexklibisz we are successfully using your plugin on 7.6.2 with nearest neighbours queries (angular) against dense vectors. Everything worked out of the box and snappy so a huge thank you for your work. @mayya-sharipova - this one needs backing :)

@rjurney
Copy link

rjurney commented Sep 12, 2020

@mayya-sharipova +1 to @BountyCountry's comment.

@raman-r-4978
Copy link

raman-r-4978 commented Sep 15, 2020

Adding my few cents here,

The common problem with most vector store lib(s) is that it doesn't support update/delete an element to/from the existing index. Integrating standard vector search libs to Lucene or ES seems a good idea, since we can make use of its functionalities like merge policy, isAlive bitset and post-processing results. Though, Milvus is built on top of various vectors libs, it handles all these fundamental problems on its own.

People who are looking for KNN support on ES, either they can use elastiknn or opendistro-KNN.

  1. Opendistro-KNN has integrated nmslib via ES plugin (java bindings) to support KNN. This plugin is also available in AWS. I am pretty sure they chose nmslib by merely seeing its performance on ANN benchmarks. I observed a few things while using Opendistro,

    • Graphs are built when you call iw.commit(), and merged when segments gets merged. so if you're going to index (~1M) documents, it is going to take a lot of time.
    • Larger the dimension, higher the time to build the graph
    • nmslib doesn't read its graph (index files) via mmap, leads to higher memory consumption.
  2. On the other hand, Elastic-KNN implemented everything from scratch. As @alexklibisz said, we really need some benchmark scores to compare Elastic-KNN with other native KNN libs such as (faiss, annoy, nmslib, hnswlib,...etc). I haven't explored Elastic-KNN much, so I can't give my feedback on this.

I think these both are the right place to start with unless you're waiting for ES to have ANN feature.

@alexklibisz
Copy link

alexklibisz commented Sep 15, 2020

@raman-rajarathinam I have some initial benchmarking results here: http://elastiknn.klibisz.com/performance/. I started working on Elastiknn before opendistro k-nn was released. I haven't measured it exactly, but I wouldn't be surprised if it's faster than elastiknn, since it's using a C library under the hood. However, introducing a subprocess with non-trivial memory usage comes with some deployment tradeoffs like you mentioned.

@ometaxas
Copy link

We have decided to contribute and adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136

Hi @mayya-sharipova. I see that both https://issues.apache.org/jira/browse/LUCENE-9004 and https://issues.apache.org/jira/browse/LUCENE-9322 have been resolved. Does this mean that ANN functionality will be exposed any time soon?

@rjurney
Copy link

rjurney commented Jul 8, 2021

@mayya-sharipova can you please update this ticket? It seems about a million people need this feature and Lucene has the capability. Where is it? :)

@joshdevins
Copy link
Member

HNSW has been implemented in preparation for Lucene 9.0, as mentioned above. A pre-requisite to Elasticsearch using that implementation is upgrading to Lucene 9.0 once it's released. The first steps in upgrading are a work-in-progress and you can follow #73324 for more details on that process. To be clear, we don't plan on including Lucene 9.0 in Elasticsearch 7.x but rather starting in 8.0. Once Lucene 9.0 is released and integrated into Elasticsearch 8.0, we will be in a position to create a good integration and experience building and searching ANN indices. Some of the reasons this takes us some time is to ensure that the performance, feature-set and usability are on-par with other Elasticsearch features. When there are some more concrete details to share about ANN, we will certainly do so. We fully acknowledge the demand for this feature!

@jtibshirani
Copy link
Contributor

A progress update -- we have been collaborating on some Lucene changes to fill in missing features and improve performance:

We're now thinking through how ANN will fit into the Elasticsearch API. I plan to open a "meta issue" soon with a list of tasks we need to complete the initial integration.

@ometaxas
Copy link

We're now thinking through how ANN will fit into the Elasticsearch API. I plan to open a "meta issue" soon with a list of tasks we need to complete the initial integration.

Is there any update on this? Could you also provide a rough time-plan for such an integration?

@alexklibisz
Copy link

alexklibisz commented Sep 27, 2021

Posting a friendly annual reminder that the KNN/ANN feature is already implemented in https://github.com/alexklibisz/elastiknn. :)

I'd encourage folks to try out the Elastiknn plugin while waiting on Elastic to integrate Lucene's ANN implementation. Speaking from experience, there are a number of non-trivial problems to solve to ensure a reliable, scalable implementation of ANN.

AFAIK, Elastic will be using the HNSW-based ANN implementation from Lucene. Elastiknn uses another approach called LSH (Locality Sensitive Hashing). They have some interesting tradeoffs, which are described at a high level in Pinecone's blog post: https://www.pinecone.io/learn/vector-indexes/. For a deeper dive on LSH, I wrote this post: https://elastiknn.com/posts/tour-de-elastiknn-august-2021/

I'm sure Elastic can eventually get this integrated, but if you're excited to try out ANN for a proof-of-concept or just to learn about KNN/ANN, Elastiknn is ready to try, and I'd be happy to hear any feedback.

Cheers

@plassr
Copy link

plassr commented Sep 27, 2021

Here is another Elasticsearch kNN plugin. It allows pre-filtering for multimodal search and scales to billions of documents with ANN. https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

@jtibshirani
Copy link
Contributor

Is there any update on this? Could you also provide a rough time-plan for such an integration?

I'm sorry for the delay, it took some time to find consensus on a plan. I opened an issue here to track the implementation: #78473. Since this feature will be based on Lucene's new ANN support (which is shipping in its upcoming 9.0 release), it only targets Elasticsearch 8.x. We don't give exact release dates, but for context we are hard at work on both ANN and the 8.0 release.

@jtibshirani
Copy link
Contributor

I'm going to close this now that we have an implementation plan here: #78473. Thank you so much for the insights and feedback on this issue (and your incredible patience)!

@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature feedback_needed :Search Relevance/Vectors Vector search stalled Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests