Adding stream ingestion alpha documentation (#2005)

* GitBook: [#332] Updating roadmap and adding stream push API docs * GitBook: [#334] Fix typo in stream ingestion docs and update other references to streaming
feast-dev · Nov 8, 2021 · 18615f7 · 18615f7
1 parent 3ee88f4
commit 18615f7
Show file tree

Hide file tree

Showing 7 changed files with 102 additions and 57 deletions.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -72,6 +72,7 @@
   * [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
   * [.feastignore](reference/feature-repository/feast-ignore.md)
 * [\[Alpha\] On demand feature view](reference/alpha-on-demand-feature-view.md)
+* [\[Alpha\] Stream ingestion](reference/alpha-stream-ingestion.md)
 * [\[Alpha\] Local feature server](reference/feature-server.md)
 * [\[Alpha\] AWS Lambda feature server](reference/alpha-aws-lambda-feature-server.md)
 * [Feast CLI reference](reference/feast-cli-commands.md)

diff --git a/docs/getting-started/architecture-and-components/overview.md b/docs/getting-started/architecture-and-components/overview.md
@@ -1,32 +1,31 @@
 # Overview
 
-![Feast Architecture Diagram](../../.gitbook/assets/image%20%284%29.png)
+![Feast Architecture Diagram](<../../.gitbook/assets/image (4).png>)
 
 ## Functionality
 
 * **Create Batch Features:** ELT/ETL systems like Spark and SQL are used to transform data in the batch store.
-* **Feast Apply:**  The user \(or CI\) publishes versioned controlled feature definitions using `feast apply`. This CLI command updates infrastructure and persists definitions in the object store registry.
-* **Feast Materialize:** The user \(or scheduler\) executes `feast materialize` which loads features from the offline store into the online store.
+* **Feast Apply:** The user (or CI) publishes versioned controlled feature definitions using `feast apply`. This CLI command updates infrastructure and persists definitions in the object store registry.
+* **Feast Materialize:** The user (or scheduler) executes `feast materialize` which loads features from the offline store into the online store.
 * **Model Training:** A model training pipeline is launched. It uses the Feast Python SDK to retrieve a training dataset and trains a model.
 * **Get Historical Features:** Feast exports a point-in-time correct training dataset based on the list of features and entity dataframe provided by the model training pipeline.
-* **Deploy Model:** The trained model binary \(and list of features\) are deployed into a model serving system. This step is not executed by Feast.
+* **Deploy Model:** The trained model binary (and list of features) are deployed into a model serving system. This step is not executed by Feast.
 * **Prediction:** A backend system makes a request for a prediction from the model serving service.
 * **Get Online Features:** The model serving service makes a request to the Feast Online Serving service for online features using a Feast SDK.
 
 ## Components
 
 A complete Feast deployment contains the following components:
 
-* **Feast Registry**: An object store \(GCS, S3\) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
+* **Feast Registry**: An object store (GCS, S3) based registry used to persist feature definitions that are registered with the feature store. Systems can discover feature data by interacting with the registry through the Feast SDK.
 * **Feast Python SDK/CLI:** The primary user facing SDK. Used to:
   * Manage version controlled feature definitions.
-  * Materialize \(load\) feature values into the online store.
+  * Materialize (load) feature values into the online store.
   * Build and retrieve training datasets from the offline store.
   * Retrieve online features.
-* **Online Store:** The online store is a database that stores only the latest feature values for each entity. The online store is populated by materialization jobs.
+* **Online Store:** The online store is a database that stores only the latest feature values for each entity. The online store is populated by materialization jobs and from [stream ingestion](../../reference/alpha-stream-ingestion.md).
 * **Offline Store:** The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. Feast does not manage the offline store directly, but runs queries against it.
 
 {% hint style="info" %}
 Java and Go Clients are also available for online feature retrieval.
 {% endhint %}
-
diff --git a/docs/getting-started/concepts/feature-view.md b/docs/getting-started/concepts/feature-view.md
@@ -25,7 +25,7 @@ driver_stats_fv = FeatureView(
 Feature views are used during
 
 * The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.
-* Loading of feature values into an online store. Feature views determine the storage schema in the online store.
+* Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from [stream sources](../../reference/alpha-stream-ingestion.md).
 * Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.
 
 {% hint style="info" %}
@@ -57,7 +57,7 @@ global_stats_fv = FeatureView(
 
 "Entity aliases" can be specified to join `entity_dataframe` columns that do not match the column names in the source table of a FeatureView.
 
-This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below. 
+This could be used if a user has no control over these column names or if there are multiple entities are a subclass of a more general entity. For example, "spammer" and "reporter" could be aliases of a "user" entity, and "origin" and "destination" could be aliases of a "location" entity as shown below.
 
 It is suggested that you dynamically specify the new FeatureView name using `.with_name` and `join_key_map` override using `.with_join_key_map` instead of needing to register each new copy.
 
@@ -78,6 +78,7 @@ location_stats_fv= FeatureView(
 )
 ```
 {% endtab %}
+
 {% tab title="temperatures_feature_service.py" %}
 ```python
 from location_stats_feature_view import location_stats_fv
@@ -150,4 +151,3 @@ def transformed_conv_rate(features_df: pd.DataFrame) -> pd.DataFrame:
     df['conv_rate_plus_val2'] = (features_df['conv_rate'] + features_df['val_to_add_2'])
     return df
 ```
-
diff --git a/docs/getting-started/faq.md b/docs/getting-started/faq.md
@@ -20,11 +20,11 @@ No, there are [feature views without entities](concepts/feature-view.md#feature-
 
 ### Does Feast provide security or access control?
 
-Feast currently does not support any access control other than the access control required for the Provider's environment \(for example, GCP and AWS permissions\).
+Feast currently does not support any access control other than the access control required for the Provider's environment (for example, GCP and AWS permissions).
 
 ### Does Feast support streaming sources?
 
-Feast is actively working on this right now. Please reach out to the Feast team if you're interested in giving feedback! 
+Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support [push based ingestion](../reference/alpha-stream-ingestion.md).
 
 ### Does Feast support composite keys?
 
@@ -42,12 +42,12 @@ Feast is designed to work at scale and support low latency online serving. Bench
 
 Yes. Specifically:
 
-* Simple lists / dense embeddings: 
-  * BigQuery supports list types natively 
-  * Redshift does not support list types, so you'll need to serialize these features into strings \(e.g. json or protocol buffers\)
-  * Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types \(see [reference](https://github.com/feast-dev/feast/blob/master/docs/specs/online_store_format.md#appendix-a-value-proto-format)\)
-* Sparse embeddings \(e.g. one hot encodings\)
-  * One way to do this efficiently is to have a protobuf or string representation of [https://www.tensorflow.org/guide/sparse\_tensor](https://www.tensorflow.org/guide/sparse_tensor)
+* Simple lists / dense embeddings:
+  * BigQuery supports list types natively
+  * Redshift does not support list types, so you'll need to serialize these features into strings (e.g. json or protocol buffers)
+  * Feast's implementation of online stores serializes features into Feast protocol buffers and supports list types (see [reference](https://github.com/feast-dev/feast/blob/master/docs/specs/online\_store\_format.md#appendix-a-value-proto-format))
+* Sparse embeddings (e.g. one hot encodings)
+  * One way to do this efficiently is to have a protobuf or string representation of [https://www.tensorflow.org/guide/sparse\_tensor](https://www.tensorflow.org/guide/sparse\_tensor)
 
 ### Does Feast support X storage engine?
 
@@ -61,7 +61,7 @@ Please follow the instructions [here](../how-to-guides/adding-support-for-a-new-
 
 Yes. There are two ways to use S3 in Feast:
 
-* Using Redshift as a data source via Spectrum \([AWS tutorial](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html)\), and then continuing with the [Running Feast with GCP/AWS](../how-to-guides/feast-gcp-aws/) guide. See a [presentation](https://youtu.be/pMFbRJ7AnBk?t=9463) we did on this at our apply\(\) meetup.
+* Using Redshift as a data source via Spectrum ([AWS tutorial](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html)), and then continuing with the [Running Feast with GCP/AWS](../how-to-guides/feast-gcp-aws/) guide. See a [presentation](https://youtu.be/pMFbRJ7AnBk?t=9463) we did on this at our apply() meetup.
 * Using the `s3_endpoint_override` in a `FileSource` data source. This endpoint is more suitable for quick proof of concepts that won't necessarily scale for production use cases.
 
 ### How can I use Spark with Feast?
@@ -76,11 +76,11 @@ Please see the [roadmap](../roadmap.md).
 
 ### What is the difference between Feast 0.9 and Feast 0.10+?
 
-Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this [document](https://docs.google.com/document/d/1AOsr_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0) for more details.
+Feast 0.10+ is much lighter weight and more extensible than Feast 0.9. It is designed to be simple to install and use. Please see this [document](https://docs.google.com/document/d/1AOsr\_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0) for more details.
 
 ### How do I migrate from Feast 0.9 to Feast 0.10+?
 
-Please see this [document](https://docs.google.com/document/d/1AOsr_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0). If you have any questions or suggestions, feel free to leave a comment on the document!
+Please see this [document](https://docs.google.com/document/d/1AOsr\_baczuARjCpmZgVd8mCqTF4AZ49OEyU4Cn-uTT0). If you have any questions or suggestions, feel free to leave a comment on the document!
 
 ### How do I contribute to Feast?
 
@@ -93,6 +93,5 @@ Feast Core and Feast Serving were both part of Feast Java. We plan to support Fe
 {% hint style="info" %}
 **Don't see your question?**
 
-We encourage you to ask questions on [Slack](https://slack.feast.dev/) or [Github](https://github.com/feast-dev/feast). Even better, once you get an answer, add the answer to this FAQ via a [pull request](../project/development-guide.md)!
+We encourage you to ask questions on [Slack](https://slack.feast.dev) or [Github](https://github.com/feast-dev/feast). Even better, once you get an answer, add the answer to this FAQ via a [pull request](../project/development-guide.md)!
 {% endhint %}
-
diff --git a/docs/how-to-guides/adding-or-reusing-tests.md b/docs/how-to-guides/adding-or-reusing-tests.md
@@ -17,35 +17,35 @@ $ tree
 
 .
 ├── e2e
-│   └── test_universal_e2e.py
+│   └── test_universal_e2e.py
 ├── feature_repos
-│   ├── repo_configuration.py
-│   └── universal
-│       ├── data_source_creator.py
-│       ├── data_sources
-│       │   ├── bigquery.py
-│       │   ├── file.py
-│       │   └── redshift.py
-│       ├── entities.py
-│       └── feature_views.py
+│   ├── repo_configuration.py
+│   └── universal
+│       ├── data_source_creator.py
+│       ├── data_sources
+│       │   ├── bigquery.py
+│       │   ├── file.py
+│       │   └── redshift.py
+│       ├── entities.py
+│       └── feature_views.py
 ├── offline_store
-│   ├── test_s3_custom_endpoint.py
-│   └── test_universal_historical_retrieval.py
+│   ├── test_s3_custom_endpoint.py
+│   └── test_universal_historical_retrieval.py
 ├── online_store
-│   ├── test_e2e_local.py
-│   ├── test_feature_service_read.py
-│   ├── test_online_retrieval.py
-│   └── test_universal_online.py
+│   ├── test_e2e_local.py
+│   ├── test_feature_service_read.py
+│   ├── test_online_retrieval.py
+│   └── test_universal_online.py
 ├── registration
-│   ├── test_cli.py
-│   ├── test_cli_apply_duplicated_featureview_names.py
-│   ├── test_cli_chdir.py
-│   ├── test_feature_service_apply.py
-│   ├── test_feature_store.py
-│   ├── test_inference.py
-│   ├── test_registry.py
-│   ├── test_universal_odfv_feature_inference.py
-│   └── test_universal_types.py
+│   ├── test_cli.py
+│   ├── test_cli_apply_duplicated_featureview_names.py
+│   ├── test_cli_chdir.py
+│   ├── test_feature_service_apply.py
+│   ├── test_feature_store.py
+│   ├── test_inference.py
+│   ├── test_registry.py
+│   ├── test_universal_odfv_feature_inference.py
+│   └── test_universal_types.py
 └── scaffolding
     ├── test_init.py
     ├── test_partial_apply.py
@@ -148,30 +148,30 @@ The key fixtures are the `environment` and `universal_data_sources` fixtures, wh
 
 ## Writing a new test or reusing existing tests
 
-To add a new test to an existing test file:
+### To add a new test to an existing test file
 
 * Use the same function signatures as an existing test (e.g. use `environment` as an argument) to include the relevant test fixtures.
 * If possible, expand an individual test instead of writing a new test, due to the cost of standing up offline / online stores.
 
-To test a new offline / online store from a plugin repo:
+### To test a new offline / online store from a plugin repo
 
 * Install Feast in editable mode with `pip install -e`.
 * The core tests for offline / online store behavior are parametrized by the `FULL_REPO_CONFIGS` variable defined in `feature_repos/repo_configuration.py`. To overwrite this variable without modifying the Feast repo, create your own file that contains a `FULL_REPO_CONFIGS` (which will require adding a new `IntegrationTestRepoConfig` or two) and set the environment variable `FULL_REPO_CONFIGS_MODULE` to point to that file. Then the core offline / online store tests can be run with `make test-python-universal`.
 * See the [custom offline store demo](https://github.com/feast-dev/feast-custom-offline-store-demo) and the [custom online store demo](https://github.com/feast-dev/feast-custom-online-store-demo) for examples.
 
-To include a new offline / online store in the main Feast repo:
+### To include a new offline / online store in the main Feast repo
 
 * Extend `data_source_creator.py` for your offline store.
 * In `repo_configuration.py` add a new`IntegrationTestRepoConfig` or two (depending on how many online stores you want to test).
 * Run the full test suite with `make test-python-integration.`
 
-To include a new online store:
+### To include a new online store
 
 * In `repo_configuration.py` add a new config that maps to a serialized version of configuration you need in `feature_store.yaml` to setup the online store.
 * In `repo_configuration.py`, add new`IntegrationTestRepoConfig` for offline stores you want to test.
 * Run the full test suite with `make test-python-integration`
 
-To use custom data in a new test:
+### To use custom data in a new test
 
 * Check `test_universal_types.py` for an example of how to do this.
 

diff --git a/docs/reference/alpha-stream-ingestion.md b/docs/reference/alpha-stream-ingestion.md
@@ -0,0 +1,45 @@
+# \[Alpha] Stream ingestion
+
+**Warning**: This is an _experimental_ feature. It's intended for early testing and feedback, and could change without warnings in future releases.
+
+{% hint style="info" %}
+To enable this feature, run **`feast alpha enable direct_ingest_to_online_store`**
+{% endhint %}
+
+## Overview
+
+Streaming data sources are important sources of feature values. A typical setup with streaming data looks like:
+
+1. Raw events come in (stream 1)
+2. Streaming transformations applied (e.g. `last_N_purchased_categories`) (stream 2)
+3. Write stream 2 values to an offline store as a historical log for training
+4. Write stream 2 values to an online store for low latency feature serving
+5. Periodically materialize feature values from the offline store into the online store for improved correctness
+
+Feast now allows users to push features previously registered in a feature view to the online store. This most commonly would be done from a stream processing job (e.g. a Beam or Spark Streaming job). Future versions of Feast will allow writing features directly to the offline store as well.
+
+## Example
+
+See [https://github.com/feast-dev/feast-demo](https://github.com/feast-dev/on-demand-feature-views-demo) for an example on how to ingest stream data into Feast.
+
+We register a feature view as normal, and during stream processing (e.g. Kafka consumers), now we push a dataframe matching the feature view schema:
+
+```python
+event_df = pd.DataFrame.from_dict(
+    {
+        "driver_id": [1001],
+        "event_timestamp": [
+            datetime(2021, 5, 13, 10, 59, 42),
+        ],
+        "created": [
+            datetime(2021, 5, 13, 10, 59, 42),
+        ],
+        "conv_rate": [1.0],
+        "acc_rate": [1.0],
+        "avg_daily_trips": [1000],
+    }
+)
+store.write_to_online_store("driver_hourly_stats", event_df)
+```
+
+Feast will coordinate between pushed stream data and regular materialization jobs to ensure only the latest feature values are written to the online store. This ensures correctness in served features for model inference.
diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -13,7 +13,7 @@ The list below contains the functionality that contributors are planning to deve
   * [x] [Synapse source (community plugin)](https://github.com/Azure/feast-azure)
   * [x] [Hive (community plugin)](https://github.com/baineng/feast-hive)
   * [x] [Postgres (community plugin)](https://github.com/nossrannug/feast-postgres)
-  * [ ] Kafka source (Planned for Q4 2021)
+  * [x] Kafka source (with [push support into the online store](reference/alpha-stream-ingestion.md))
   * [ ] Snowflake source (Planned for Q4 2021)
   * [ ] HTTP source
 * **Offline Stores**
@@ -38,7 +38,8 @@ The list below contains the functionality that contributors are planning to deve
   * [ ] Cassandra
 * **Streaming**
   * [x] [Custom streaming ingestion job support](https://docs.feast.dev/how-to-guides/creating-a-custom-provider)
-  * [ ] Streaming ingestion on AWS (Planned for Q4 2021)
+  * [x] [Push based streaming data ingestion](reference/alpha-stream-ingestion.md)
+  * [ ] Streaming ingestion on AWS
   * [ ] Streaming ingestion on GCP
 * **Feature Engineering**
   * [x] On-demand Transformations (Alpha release. See [RFC](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit#))
@@ -53,9 +54,9 @@ The list below contains the functionality that contributors are planning to deve
   * [x] Python Client
   * [x] REST Feature Server (Python) (Alpha release. See [RFC](https://docs.google.com/document/d/1iXvFhAsJ5jgAhPOpTdB3j-Wj1S9x3Ev\_Wr6ZpnLzER4/edit))
   * [x] gRPC Feature Server (Java) (See [#1497](https://github.com/feast-dev/feast/issues/1497))
+  * [x] Push API
   * [ ] Java Client
   * [ ] Go Client
-  * [ ] Push API
   * [ ] Delete API
   * [ ] Feature Logging (for training)
 * **Data Quality Management**