Skip to content

Commit

Permalink
merged main branch
Browse files Browse the repository at this point in the history
  • Loading branch information
shivamsanju committed Jun 26, 2022
2 parents 8b582d3 + b5033c2 commit b2ee907
Show file tree
Hide file tree
Showing 12 changed files with 221 additions and 87 deletions.
17 changes: 8 additions & 9 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ Feathr automatically computes your feature values and joins them to your trainin
- **Native cloud integration** with simplified and scalable architecture, which is illustrated in the next section.
- **Feature sharing and reuse made easy:** Feathr has built-in feature registry so that features can be easily shared across different teams and boost team productivity.


## Running Feathr on Azure with 3 Simple Steps

Feathr has native cloud integration. To use Feathr on Azure, you only need three steps:
Expand All @@ -50,7 +49,7 @@ Feathr has native cloud integration. To use Feathr on Azure, you only need three
If you are not using the above Jupyter Notebook and want to install Feathr client locally, use this:

```bash
pip install -U feathr
pip install feathr
```

Or use the latest code from GitHub:
Expand Down Expand Up @@ -126,31 +125,30 @@ Read the [Streaming Source Ingestion Guide](https://linkedin.github.io/feathr/ho

Read [Point-in-time Correctness and Point-in-time Join in Feathr](https://linkedin.github.io/feathr/concepts/point-in-time-join.html) for more details.


## Running Feathr Examples

Follow the [quick start Jupyter Notebook](./feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out. There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart.html) containing a bit more explanation on the notebook.

Follow the [quick start Jupyter Notebook](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/product_recommendation_demo.ipynb) to try it out.
There is also a companion [quick start guide](https://linkedin.github.io/feathr/quickstart_synapse.html) containing a bit more explanation on the notebook.

## Cloud Architecture

Feathr has native integration with Azure and other cloud services, and here's the high-level architecture to help you get started.
![Architecture](images/architecture.png)

# Next Steps
## Next Steps

## Quickstart
### Quickstart

- [Quickstart for Azure Synapse](quickstart_synapse.md)

## Concepts
### Concepts

- [Feature Definition](concepts/feature-definition.md)
- [Feature Generation](concepts/feature-generation.md)
- [Feature Join](concepts/feature-join.md)
- [Point-in-time Correctness](concepts/point-in-time-join.md)

## How-to-guides
### How-to-guides

- [Azure Deployment](how-to-guides/azure-deployment.md)
- [Local Feature Testing](how-to-guides/local-feature-testing.md)
Expand All @@ -159,4 +157,5 @@ Feathr has native integration with Azure and other cloud services, and here's th
- [Feathr Job Configuration](how-to-guides/feathr-job-configuration.md)

## API Documentation

- [Python API Documentation](https://feathr.readthedocs.io/en/latest/)
96 changes: 66 additions & 30 deletions docs/concepts/feature-generation.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
---
layout: default
title: Feathr Feature Generation
title: Feature Generation and Materialization
parent: Feathr Concepts
---

# Feature Generation
Feature generation is the process to create features from raw source data into a certain persisted storage.
# Feature Generation and Materialization

User could utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused(usually in offline setting). Feature generation is also useful in generating embedding features. Embedding distill information from large data and it is usually more compact.
Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).

User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.

## Generating Features to Online Store
When we need to serve the models online, we also need to serve the features online. We provide APIs to generate features to online storage for future consumption. For example:

When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:

```python
client = FeathrClient()
redisSink = RedisSink(table_name="nycTaxiDemoFeature")
Expand All @@ -21,12 +24,16 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
client.materialize_features(settings)
```

([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)
More reference on the APIs:

- [MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
- [RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)

In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis.

It is also possible to backfill the features for a previous time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivilant to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).
## Feature Backfill

It is also possible to backfill the features for a particular time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`).

```python
client = FeathrClient()
Expand All @@ -39,29 +46,34 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
client.materialize_features(settings)
```

([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime),
[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features))
Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features.

## Consuming the online features
Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have
`BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance.

```python
client.wait_job_to_finish(timeout_sec=600)
More reference on the APIs:

- [BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime)
- [client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)

res = client.get_online_features('nycTaxiDemoFeature', '265', [
'f_location_avg_fare', 'f_location_max_fare'])


## Consuming features in online environment

After the materialization job is finished, we can get the online features by querying the `feature table`, corresponding `entity key` and a list of `feature names`. In the example below, we query the online features called `f_location_avg_fare` and `f_location_max_fare`, and query with a key `265` (which is the location ID).

```python
res = client.get_online_features('nycTaxiDemoFeature', '265', ['f_location_avg_fare', 'f_location_max_fare'])
```

([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features))
More reference on the APIs:
- [client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)

After we finish running the materialization job, we can get the online features by querying the feature name, with the
corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and
`f_location_max_fare`, and query with a key `265` (which is the location ID).
## Materializing Features to Offline Store

## Generating Features to Offline Store
This is useful when the feature transformation is compute intensive and features can be re-used. For example, you have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training pipeline. In this case, you should consider generating features to offline.

This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you
have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training
pipeline. In this case, you should consider generate features to offline. Here is an API example:
The API call is very similar to materializing features to online store, and here is an API example:

```python
client = FeathrClient()
Expand All @@ -73,14 +85,14 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
client.materialize_features(settings)
```

This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2022/05/21`

You can also specify a BackfillTime so the features will be generated for those dates. For example:
You can also specify a `BackfillTime` so the features will be generated only for those dates. For example:

```Python
backfill_time = BackfillTime(start=datetime(
2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))
2020, 5, 10), end=datetime(2020, 5, 20), step=timedelta(days=1))
offline_sink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/")
settings = MaterializationSettings("nycTaxiTable",
sinks=[offline_sink],
Expand All @@ -89,8 +101,32 @@ settings = MaterializationSettings("nycTaxiTable",
backfill_time=backfill_time)
```

This will generate features only for 2020/05/20 for me and it will be in folder:
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`
This will generate features from `2020/05/10` to `2020/05/20` and the output will have 11 folders, from
`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy).

You can also specify the format of the materialized features in the offline store by using `execution_configurations` like below. Please refer to the [documentation](../how-to-guides/feathr-job-configuration.md) here for those configuration details.

```python

from feathr import HdfsSink
offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_data/")
# Materialize two features into a Offline store.
settings = MaterializationSettings("nycTaxiMaterializationJob",
sinks=[offlineSink],
feature_names=["f_location_avg_fare", "f_location_max_fare"])
client.materialize_features(settings, execution_configurations={ "spark.feathr.outputFormat": "parquet"})

```

For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:

```python

path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
res = get_result_df(client=client, format="parquet", res_url=path)
```

More reference on the APIs:

([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink))
- [MaterializationSettings API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
- [HdfsSink API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource)
99 changes: 99 additions & 0 deletions docs/how-to-guides/azure_resource_provision.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,52 @@
"description": "Whether or not to deploy eventhub provision script"
}
},
"databaseServerName": {
"type": "string",
"defaultValue": "[concat('server-', uniqueString(resourceGroup().id, deployment().name))]",
"metadata": {
"description": "Specifies the name for the SQL server"
}
},
"databaseName": {
"type": "string",
"defaultValue": "[concat('db-', uniqueString(resourceGroup().id, deployment().name), '-1')]",
"metadata": {
"description": "Specifies the name for the SQL database under the SQL server"
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Specifies the location for server and database"
}
},
"adminUser": {
"type": "string",
"metadata": {
"description": "Specifies the username for admin"
}
},
"adminPassword": {
"type": "securestring",
"metadata": {
"description": "Specifies the password for admin"
}
},
"storageAccountKey": {
"type": "string",
"metadata": {
"description": "Specifies the key of the storage account where the BACPAC file is stored."
}
},
"bacpacUrl": {
"type": "string",
"defaultValue": "https://azurefeathrstorage.blob.core.windows.net/public/feathr-registry-schema.bacpac",
"metadata": {
"description": "This is the pre-created BACPAC file that contains required schemas by the registry server."
}
},
"dockerImage": {
"defaultValue": "blrchen/feathr-sql-registry",
"type": "String",
Expand Down Expand Up @@ -393,6 +439,59 @@
"principalId": "[parameters('principalId')]",
"scope": "[resourceGroup().id]"
}
},
{
"type": "Microsoft.Sql/servers",
"apiVersion": "2021-11-01-preview",
"name": "[parameters('databaseServerName')]",
"location": "[parameters('location')]",
"properties": {
"administratorLogin": "[parameters('adminUser')]",
"administratorLoginPassword": "[parameters('adminPassword')]",
"version": "12.0"
},
"resources": [
{
"type": "firewallrules",
"apiVersion": "2021-11-01-preview",
"name": "AllowAllAzureIps",
"location": "[parameters('location')]",
"dependsOn": [
"[parameters('databaseServerName')]"
],
"properties": {
"startIpAddress": "0.0.0.0",
"endIpAddress": "0.0.0.0"
}
}
]
},
{
"type": "Microsoft.Sql/servers/databases",
"apiVersion": "2021-11-01-preview",
"name": "[concat(string(parameters('databaseServerName')), '/', string(parameters('databaseName')))]",
"location": "[parameters('location')]",
"dependsOn": [
"[concat('Microsoft.Sql/servers/', parameters('databaseServerName'))]"
],
"resources": [
{
"type": "extensions",
"apiVersion": "2021-11-01-preview",
"name": "Import",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers/databases', parameters('databaseServerName'), parameters('databaseName'))]"
],
"properties": {
"storageKeyType": "StorageAccessKey",
"storageKey": "[parameters('storageAccountKey')]",
"storageUri": "[parameters('bacpacUrl')]",
"administratorLogin": "[parameters('adminUser')]",
"administratorLoginPassword": "[parameters('adminPassword')]",
"operationMode": "Import"
}
}
]
}
],
"outputs": {}
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to-guides/feathr-job-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ parent: Feathr How-to Guides

# Feathr Job Configuration

Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configuratons` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:
Since Feathr uses Spark as the underlying execution engine, there's a way to override Spark configuration by `FeathrClient.get_offline_features()` with `execution_configurations` parameters. The complete list of the available spark configuration is located in [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html) (though not all of those are honored for cloud hosted Spark platforms such as Databricks), and there are a few Feathr specific ones that are documented here:

| Property Name | Default | Meaning | Since Version |
| ------------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
Expand Down
Loading

0 comments on commit b2ee907

Please sign in to comment.