Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Elasticsearch vector database #4188

Merged
merged 18 commits into from
May 13, 2024
19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,25 @@ test-python-universal-cassandra-no-cloud-providers:
not test_snowflake" \
sdk/python/tests

test-python-universal-elasticsearch-online:
PYTHONPATH='.' \
FULL_REPO_CONFIGS_MODULE=sdk.python.feast.infra.online_stores.contrib.elasticsearch_repo_configuration \
PYTEST_PLUGINS=sdk.python.tests.integration.feature_repos.universal.online_store.elasticsearch \
python -m pytest -n 8 --integration \
-k "not test_universal_cli and \
not test_go_feature_server and \
not test_feature_logging and \
not test_reorder_columns and \
not test_logged_features_validation and \
not test_lambda_materialization_consistency and \
not test_offline_write and \
not test_push_features_to_offline_store and \
not gcs_registry and \
not s3_registry and \
not test_universal_types and \
not test_snowflake" \
sdk/python/tests

test-python-universal:
python -m pytest -n 8 --integration sdk/python/tests

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/alpha-vector-database.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Below are supported vector databases and implemented features:
| Vector Database | Retrieval | Indexing |
|-----------------|-----------|----------|
| Pgvector | [x] | [ ] |
| Elasticsearch | [ ] | [ ] |
| Elasticsearch | [x] | [x] |
| Milvus | [ ] | [ ] |
| Faiss | [ ] | [ ] |

Expand Down
125 changes: 125 additions & 0 deletions docs/reference/online-stores/elasticsearch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# ElasticSearch online store (contrib)

## Description

The ElasticSearch online store provides support for materializing tabular feature values, as well as embedding feature vectors, into an ElasticSearch index for serving online features. \
The embedding feature vectors are stored as dense vectors, and can be used for similarity search. More information on dense vectors can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html).

## Getting started
In order to use this online store, you'll need to run `pip install 'feast[elasticsearch]'`. You can get started by then running `feast init -t elasticsearch`.

## Example

{% code title="feature_store.yaml" %}
```yaml
project: my_feature_repo
registry: data/registry.db
provider: local
online_store:
type: elasticsearch
host: ES_HOST
port: ES_PORT
user: ES_USERNAME
password: ES_PASSWORD
vector_len: 512
write_batch_size: 1000
```
{% endcode %}

The full set of configuration options is available in [ElasticsearchOnlineStoreConfig](https://rtd.feast.dev/en/master/#feast.infra.online_stores.contrib.elasticsearch.ElasticsearchOnlineStoreConfig).

## Functionality Matrix


| | Postgres |
| :-------------------------------------------------------- | :------- |
| write feature values to the online store | yes |
| read feature values from the online store | yes |
| update infrastructure (e.g. tables) in the online store | yes |
| teardown infrastructure (e.g. tables) in the online store | yes |
| generate a plan of infrastructure changes | no |
| support for on-demand transforms | yes |
| readable by Python SDK | yes |
| readable by Java | no |
| readable by Go | no |
| support for entityless feature views | yes |
| support for concurrent writing to the same key | no |
| support for ttl (time to live) at retrieval | no |
| support for deleting expired data | no |
| collocated by feature view | yes |
| collocated by feature service | no |
| collocated by entity key | no |

To compare this set of functionality against other online stores, please see the full [functionality matrix](overview.md#functionality-matrix).

## Retrieving online document vectors

The ElasticSearch online store supports retrieving document vectors for a given list of entity keys. The document vectors are returned as a dictionary where the key is the entity key and the value is the document vector. The document vector is a dense vector of floats.

{% code title="python" %}
```python
from feast import FeatureStore

feature_store = FeatureStore(repo_path="feature_store.yaml")

query_vector = [1.0, 2.0, 3.0, 4.0, 5.0]
top_k = 5

# Retrieve the top k closest features to the query vector

feature_values = feature_store.retrieve_online_documents(
feature="my_feature",
query=query_vector,
top_k=top_k
)
```
{% endcode %}

## Indexing
Currently, the indexing mapping in the ElasticSearch online store is configured as:

{% code title="indexing_mapping" %}
```json
"properties": {
"entity_key": {"type": "binary"},
"feature_name": {"type": "keyword"},
"feature_value": {"type": "binary"},
"timestamp": {"type": "date"},
"created_ts": {"type": "date"},
"vector_value": {
"type": "dense_vector",
"dims": config.online_store.vector_len,
"index": "true",
"similarity": config.online_store.similarity,
},
}
```
{% endcode %}
And the online_read API mapping is configured as:

{% code title="online_read_mapping" %}
```json
"query": {
"bool": {
"must": [
{"terms": {"entity_key": entity_keys}},
{"terms": {"feature_name": requested_features}},
]
}
},
```
{% endcode %}

And the similarity search API mapping is configured as:

{% code title="similarity_search_mapping" %}
```json
{
"field": "vector_value",
"query_vector": embedding_vector,
"k": top_k,
}
```
{% endcode %}

These APIs are subject to change in future versions of Feast to improve performance and usability.
6 changes: 3 additions & 3 deletions sdk/python/feast/feature_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -1886,7 +1886,7 @@ def retrieve_online_documents(
feature: str,
query: Union[str, List[float]],
top_k: int,
distance_metric: str,
distance_metric: Optional[str] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to make it optional as elasticsearch doesn't allow specifying the metric in the online API. Instead, it has to update the index.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense can we add some explicit tests for this to highlight the behavior? Also this is awesome.

) -> OnlineResponse:
"""
Retrieves the top k closest document features. Note, embeddings are a subset of features.
Expand All @@ -1911,7 +1911,7 @@ def _retrieve_online_documents(
feature: str,
query: Union[str, List[float]],
top_k: int,
distance_metric: str = "L2",
distance_metric: Optional[str] = None,
):
if isinstance(query, str):
raise ValueError(
Expand Down Expand Up @@ -2209,7 +2209,7 @@ def _retrieve_from_online_store(
requested_feature: str,
query: List[float],
top_k: int,
distance_metric: str,
distance_metric: Optional[str],
) -> List[Tuple[Timestamp, "FieldStatus.ValueType", Value, Value, Value]]:
"""
Search and return document features from the online document store.
Expand Down
Loading
Loading