merged main branch

feathr-ai · Jun 26, 2022 · b2ee907 · b2ee907
2 parents 8b582d3 + b5033c2
commit b2ee907
Show file tree

Hide file tree

Showing 12 changed files with 221 additions and 87 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -28,7 +28,6 @@ Feathr automatically computes your feature values and joins them to your trainin
 - **Native cloud integration** with simplified and scalable architecture, which is illustrated in the next section.
 - **Feature sharing and reuse made easy:** Feathr has built-in feature registry so that features can be easily shared across different teams and boost team productivity.
 
-
 ## Running Feathr on Azure with 3 Simple Steps
 
 Feathr has native cloud integration. To use Feathr on Azure, you only need three steps:
@@ -50,7 +49,7 @@ Feathr has native cloud integration. To use Feathr on Azure, you only need three
 If you are not using the above Jupyter Notebook and want to install Feathr client locally, use this:
 
 ```bash
-pip install -U feathr
+pip install feathr
 ```
 
 Or use the latest code from GitHub:
@@ -126,31 +125,30 @@ Read the [Streaming Source Ingestion Guide](https://linkedin.github.io/feathr/ho
 
 Read [Point-in-time Correctness and Point-in-time Join in Feathr](https://linkedin.github.io/feathr/concepts/point-in-time-join.html) for more details.
 
-
 ## Running Feathr Examples
 
-Follow the [quick start Jupyter Notebook](./feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out. There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart.html) containing a bit more explanation on the notebook.
-
+Follow the [quick start Jupyter Notebook](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out.
+There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart_synapse.html) containing a bit more explanation on the notebook.
 
 ## Cloud Architecture
 
 Feathr has native integration with Azure and other cloud services, and here's the high-level architecture to help you get started.
 ![Architecture](images/architecture.png)
 
-# Next Steps
+## Next Steps
 
-## Quickstart
+### Quickstart
 
 - [Quickstart for Azure Synapse](quickstart_synapse.md)
 
-## Concepts
+### Concepts
 
 - [Feature Definition](concepts/feature-definition.md)
 - [Feature Generation](concepts/feature-generation.md)
 - [Feature Join](concepts/feature-join.md)
 - [Point-in-time Correctness](concepts/point-in-time-join.md)
 
-## How-to-guides
+### How-to-guides
 
 - [Azure Deployment](how-to-guides/azure-deployment.md)
 - [Local Feature Testing](how-to-guides/local-feature-testing.md)
@@ -159,4 +157,5 @@ Feathr has native integration with Azure and other cloud services, and here's th
 - [Feathr Job Configuration](how-to-guides/feathr-job-configuration.md)
 
 ## API Documentation
+
 - [Python API Documentation](https://feathr.readthedocs.io/en/latest/)
diff --git a/docs/concepts/feature-generation.md b/docs/concepts/feature-generation.md
@@ -1,16 +1,19 @@
 ---
 layout: default
-title: Feathr Feature Generation
+title: Feature Generation and Materialization
 parent: Feathr Concepts
 ---
 
-# Feature Generation
-Feature generation is the process to create features from raw source data into a certain persisted storage.
+# Feature Generation and Materialization
 
-User could utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused(usually in offline setting). Feature generation is also useful in generating embedding features. Embedding distill information from large data and it is usually more compact.
+Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
+
+User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.
 
 ## Generating Features to Online Store
-When we need to serve the models online, we also need to serve the features online. We provide APIs to generate features to online storage for future consumption. For example:
+
+When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:
+
 ```python
 client = FeathrClient()
 redisSink = RedisSink(table_name="nycTaxiDemoFeature")
@@ -21,12 +24,16 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
 client.materialize_features(settings)
 ```
 
-([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
-[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)
+More reference on the APIs:
+
+- [MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
+- [RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)
 
 In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis.
 
-It is also possible to backfill the features for a previous time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivilant to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).
+## Feature Backfill
+
+It is also possible to backfill the features for a particular time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).
 
 ```python
 client = FeathrClient()
@@ -39,29 +46,34 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
 client.materialize_features(settings)
 ```
 
-([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime),
-[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features))
+Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features.
 
-## Consuming the online features
+Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have 
+`BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance.
 
-```python
-client.wait_job_to_finish(timeout_sec=600)
+More reference on the APIs:
+
+- [BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime)
+- [client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)
 
-res = client.get_online_features('nycTaxiDemoFeature', '265', [
-                                     'f_location_avg_fare', 'f_location_max_fare'])
+
+
+## Consuming features in online environment
+
+After the materialization job is finished, we can get the online features by querying the `feature table`, corresponding `entity key` and a list of `feature names`. In the example below, we query the online features called `f_location_avg_fare` and `f_location_max_fare`, and query with a key `265` (which is the location ID).
+
+```python
+res = client.get_online_features('nycTaxiDemoFeature', '265', ['f_location_avg_fare', 'f_location_max_fare'])
 ```
 
-([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features))
+More reference on the APIs:
+- [client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)
 
-After we finish running the materialization job, we can get the online features by querying the feature name, with the 
-corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and 
-`f_location_max_fare`, and query with a key `265` (which is the location ID).
+## Materializing Features to Offline Store
 
-## Generating Features to Offline Store
+This is useful when the feature transformation is compute intensive and features can be re-used. For example, you have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training pipeline. In this case, you should consider generating features to offline.
 
-This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you 
-have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training 
-pipeline. In this case, you should consider generate features to offline. Here is an API example:
+The API call is very similar to materializing features to online store, and here is an API example:
 
 ```python
 client = FeathrClient()
@@ -73,14 +85,14 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
 client.materialize_features(settings)
 ```
 
-This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path: 
+This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
 `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2022/05/21`
 
-You can also specify a BackfillTime so the features will be generated for those dates. For example:
+You can also specify a `BackfillTime` so the features will be generated only for those dates. For example:
 
 ```Python
 backfill_time = BackfillTime(start=datetime(
-    2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))
+    2020, 5, 10), end=datetime(2020, 5, 20), step=timedelta(days=1))
 offline_sink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/")
 settings = MaterializationSettings("nycTaxiTable",
                                    sinks=[offline_sink],
@@ -89,8 +101,32 @@ settings = MaterializationSettings("nycTaxiTable",
                                    backfill_time=backfill_time)
 ```
 
-This will generate features only for 2020/05/20 for me and it will be in folder:
-`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`
+This will generate features from `2020/05/10` to `2020/05/20` and the output will have 11 folders, from
+`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy).
+
+You can also specify the format of the materialized features in the offline store by using `execution_configurations` like below. Please refer to the [documentation](../how-to-guides/feathr-job-configuration.md) here for those configuration details.
+
+```python
+
+from feathr import HdfsSink
+offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_data/")
+# Materialize two features into a Offline store.
+settings = MaterializationSettings("nycTaxiMaterializationJob",
+                                   sinks=[offlineSink],
+                                   feature_names=["f_location_avg_fare", "f_location_max_fare"])
+client.materialize_features(settings, execution_configurations={ "spark.feathr.outputFormat": "parquet"})
+
+```
+
+For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:
+
+```python
+
+path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
+res = get_result_df(client=client, format="parquet", res_url=path)
+```
+
+More reference on the APIs:
 
-([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
-[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink))
+- [MaterializationSettings API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
+- [HdfsSink API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource)
diff --git a/docs/how-to-guides/azure_resource_provision.json b/docs/how-to-guides/azure_resource_provision.json
@@ -40,6 +40,52 @@
         "description": "Whether or not to deploy eventhub provision script"
       }
     },
+    "databaseServerName": {
+      "type": "string",
+      "defaultValue": "[concat('server-', uniqueString(resourceGroup().id, deployment().name))]",
+      "metadata": {
+        "description": "Specifies the name for the SQL server"
+      }
+    },
+    "databaseName": {
+      "type": "string",
+      "defaultValue": "[concat('db-', uniqueString(resourceGroup().id, deployment().name), '-1')]",
+      "metadata": {
+        "description": "Specifies the name for the SQL database under the SQL server"
+      }
+    },
+    "location": {
+      "type": "string",
+      "defaultValue": "[resourceGroup().location]",
+      "metadata": {
+        "description": "Specifies the location for server and database"
+      }
+    },
+    "adminUser": {
+      "type": "string",
+      "metadata": {
+        "description": "Specifies the username for admin"
+      }
+    },
+    "adminPassword": {
+      "type": "securestring",
+      "metadata": {
+        "description": "Specifies the password for admin"
+      }
+    },
+    "storageAccountKey": {
+      "type": "string",
+      "metadata": {
+        "description": "Specifies the key of the storage account where the BACPAC file is stored."
+      }
+    },
+    "bacpacUrl": {
+      "type": "string",
+      "defaultValue": "https://azurefeathrstorage.blob.core.windows.net/public/feathr-registry-schema.bacpac",
+      "metadata": {
+        "description": "This is the pre-created BACPAC file that contains required schemas by the registry server."
+      }
+    },
     "dockerImage": {
       "defaultValue": "blrchen/feathr-sql-registry",
       "type": "String",
@@ -393,6 +439,59 @@
         "principalId": "[parameters('principalId')]",
         "scope": "[resourceGroup().id]"
       }
+    },
+    {
+      "type": "Microsoft.Sql/servers",
+      "apiVersion": "2021-11-01-preview",
+      "name": "[parameters('databaseServerName')]",
+      "location": "[parameters('location')]",
+      "properties": {
+        "administratorLogin": "[parameters('adminUser')]",
+        "administratorLoginPassword": "[parameters('adminPassword')]",
+        "version": "12.0"
+      },
+      "resources": [
+        {
+          "type": "firewallrules",
+          "apiVersion": "2021-11-01-preview",
+          "name": "AllowAllAzureIps",
+          "location": "[parameters('location')]",
+          "dependsOn": [
+            "[parameters('databaseServerName')]"
+          ],
+          "properties": {
+            "startIpAddress": "0.0.0.0",
+            "endIpAddress": "0.0.0.0"
+          }
+        }
+      ]
+    },
+    {
+      "type": "Microsoft.Sql/servers/databases",
+      "apiVersion": "2021-11-01-preview",
+      "name": "[concat(string(parameters('databaseServerName')), '/', string(parameters('databaseName')))]",
+      "location": "[parameters('location')]",
+      "dependsOn": [
+        "[concat('Microsoft.Sql/servers/', parameters('databaseServerName'))]"
+      ],
+      "resources": [
+        {
+          "type": "extensions",
+          "apiVersion": "2021-11-01-preview",
+          "name": "Import",
+          "dependsOn": [
+            "[resourceId('Microsoft.Sql/servers/databases', parameters('databaseServerName'), parameters('databaseName'))]"
+          ],
+          "properties": {
+            "storageKeyType": "StorageAccessKey",
+            "storageKey": "[parameters('storageAccountKey')]",
+            "storageUri": "[parameters('bacpacUrl')]",
+            "administratorLogin": "[parameters('adminUser')]",
+            "administratorLoginPassword": "[parameters('adminPassword')]",
+            "operationMode": "Import"
+          }
+        }
+      ]
     }
   ],
   "outputs": {}

diff --git a/docs/how-to-guides/feathr-job-configuration.md b/docs/how-to-guides/feathr-job-configuration.md
@@ -6,7 +6,7 @@ parent: Feathr How-to Guides
 
 # Feathr Job Configuration
 
-Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configuratons` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:
+Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configurations` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:
 
 | Property Name             | Default | Meaning                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Since Version |
 | ------------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |