Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update & fix #403

Merged
merged 11 commits into from
Jul 9, 2022
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Feathr is the feature store that is used in production in LinkedIn for many year
Feathr lets you:

- **Define features** based on raw data sources (batch and streaming) using pythonic APIs.
- **Register and get features by names** during model training and model inferencing.
- **Register and get features by names** during model training and model inference.
- **Share features** across your team and company.

Feathr automatically computes your feature values and joins them to your training data, using point-in-time-correct semantics to avoid data leakage, and supports materializing and deploying your features for use online in production.
Expand Down Expand Up @@ -151,7 +151,7 @@ Follow the [quick start Jupyter Notebook](./feathr_project/feathrcli/data/feathr

## 🚀 Roadmap

For a complete roadmap with esitmated dates, please [visit this page](https://github.com/linkedin/feathr/milestones?direction=asc&sort=title&state=open).
For a complete roadmap with estimated dates, please [visit this page](https://github.com/linkedin/feathr/milestones?direction=asc&sort=title&state=open).

- [x] Private Preview release
- [x] Public Preview release
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/feature-definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Note that the `agg_func`([API doc](https://feathr.readthedocs.io/en/latest/feath
| Aggregation Type | Input Type | Description |
| --- | --- | --- |
|SUM, COUNT, MAX, MIN, AVG |Numeric|Applies the the numerical operation on the numeric inputs. |
|MAX_POOLING, MIN_POOLING, AVG_POOLING | Numeric Vector | Applies the max/min/avg operation on a per entry bassis for a given a collection of numbers.|
|MAX_POOLING, MIN_POOLING, AVG_POOLING | Numeric Vector | Applies the max/min/avg operation on a per entry basis for a given a collection of numbers.|
|LATEST| Any |Returns the latest not-null values from within the defined time window |


Expand Down
6 changes: 2 additions & 4 deletions docs/concepts/feature-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,13 @@ client.materialize_features(settings)

Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features.

Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have
`BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance.
Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have `BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance.

More reference on the APIs:

- [BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime)
- [client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)



## Consuming features in online environment

After the materialization job is finished, we can get the online features by querying the `feature table`, corresponding `entity key` and a list of `feature names`. In the example below, we query the online features called `f_location_avg_fare` and `f_location_max_fare`, and query with a key `265` (which is the location ID).
Expand All @@ -67,6 +64,7 @@ res = client.get_online_features('nycTaxiDemoFeature', '265', ['f_location_avg_f
```

More reference on the APIs:

- [client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)

## Materializing Features to Offline Store
Expand Down
61 changes: 31 additions & 30 deletions docs/concepts/feature-join.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,47 +8,48 @@ parent: Feathr Concepts

## Intuitions of Frame Join

Observation dataset has 2 records as below, and we want to use it as the 'spine' dataset, joining two features onto it:
Observation dataset has 2 records as below, and we want to use it as the 'spine' dataset, joining two
features onto it:

1) Feature `page_view_count` from dataset `page_view_data`
1. Feature 'page_view_count' from dataset 'page_view_data'

2) Feature `like_count` from dataset `like_count_data`
2. Feature 'like_count' from dataset 'like_count_data'

The Feathr feature join in this case, will use the field `id` as join key of the observation data, and also consider the timestamp of each row during the join, making sure the joined feature values are collected **before** the observation_time of each row.
2) Feature `like_count` from dataset `like_count_data`

| id | observe_time | Label |
| --- | --- | --- |
| 1 | 2022-01-01 | Yes |
| 1 | 2022-01-02 | Yes |
| 2 | 2022-01-02 | No |
| id | observe_time | Label |
| --- | ------------ | ----- |
| 1 | 2022-01-01 | Yes |
| 1 | 2022-01-02 | Yes |
| 2 | 2022-01-02 | No |

Dataset `page_view_data` contains `page_view_count` of each user at a given time:

| UserId | log_time | page_view_count |
| --- | --- | --- |
|1 | 2022-01-01 | 101 |
|1 | 2022-01-02 | 102 |
|1 | 2022-01-03 | 103 |
|2 | 2022-01-02 | 200 |
|3 | 2022-01-02 | 300 |
| UserId | log_time | page_view_count |
| ------ | ---------- | --------------- |
| 1 | 2022-01-01 | 101 |
| 1 | 2022-01-02 | 102 |
| 1 | 2022-01-03 | 103 |
| 2 | 2022-01-02 | 200 |
| 3 | 2022-01-02 | 300 |

Dataset `like_count_data` contains `like_count` of each user at a given time:
Dataset 'like_count_data' contains "like_count" of each user at a given time:

| UserId | updated_time | 'like_count' |
| --- | --- | --- |
|1 | 2022-01-01 | 11 |
|1 | 2022-01-02 | 12 |
|1 | 2022-01-03 | 13 |
|2 | 2022-01-02 | 20 |
|3 | 2022-01-02 | 30 |
| ------ | ------------ | ------------ |
| 1 | 2022-01-01 | 11 |
| 1 | 2022-01-02 | 12 |
| 1 | 2022-01-03 | 13 |
| 2 | 2022-01-02 | 20 |
| 3 | 2022-01-02 | 30 |

The expected joined output, a.k.a. training dataset would be assuming feature:

| id | observe_time | Label | f_page_view_count | f_like_count|
| --- | --- | --- | --- | --- |
|1 | 2022-01-01 | Yes | 101 | 11 |
|1 | 2022-01-02 | Yes | 102 | 12 |
|2 | 2022-01-02 | No | 200 | 20
| id | observe_time | Label | f_page_view_count | f_like_count |
| --- | ------------ | ----- | ----------------- | ------------ |
| 1 | 2022-01-01 | Yes | 101 | 11 |
| 1 | 2022-01-02 | Yes | 102 | 12 |
| 2 | 2022-01-02 | No | 200 | 20 |

Note: In the above example, feature `f_page_view_count` and `f_like_count` are defined as simply a reference of field `page_view_count` and `like_count` respectively. Timestamp in these 3 datasets are considered automatically.

Expand Down Expand Up @@ -83,8 +84,8 @@ The path of a dataset as the 'spine' for the to-be-created training dataset. We
2. A column representing the event time of the row. By default, Feathr will make sure the feature values joined have a timestamp earlier than it, ensuring no data leakage in the resulting training dataset.

3. Other columns will be simply pass through onto the output training dataset.
The key fields from the observation data, which are used to joined with the feature data.
List of feature names to be joined with the observation data. They must be pre-defined in the Python APIs.
The key fields from the observation data, which are used to joined with the feature data.
List of feature names to be joined with the observation data. They must be pre-defined in the Python APIs.

The time information of the observation data used to compare with the feature's timestamp during the join.

Expand Down
5 changes: 1 addition & 4 deletions docs/concepts/point-in-time-join.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,7 @@ The model will perform better during training(usually), but it will not perform

Point-in-time correctness ensures that no future data is used for training.

Point-in-time correctness can be achieved via two approaches. If your observation data has a global timestamp for all observation events, then you can simply time-travel your feature dataset back to that timestamp. If your observation data has different timestamps for each observation events, then you need to point-in-time join for each events.

- The first approach is easier to implement but have more restrictions (global timestamp).
- The second approach provides better flexibility and no feature data is wasted. Feathr uses the second approach and can scale to large datasets.
Point-in-time correctness can be achieved via two approaches. If your observation data has a global timestamp for all observation events, then you can simply time-travel your feature dataset back to that timestamp. If your observation data has different timestamps for each observation events, then you need to point-in-time join for each events. The first approach is easier to implement but have more restrictions(global timestamp). The second approach provides better flexibility and no feature data is wasted. Feathr uses the second approach and can scale to large datasets.

## Point-in-time Feature Lookup in Feathr

Expand Down
2 changes: 1 addition & 1 deletion docs/how-to-guides/azure-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ external_ip=$(curl -s http://whatismyip.akamai.com/)
echo "External IP is: ${external_ip}. Adding it to firewall rules"
az synapse workspace firewall-rule create --name allowAll --workspace-name $synapse_workspace_name --resource-group $resoruce_group_name --start-ip-address "$external_ip" --end-ip-address "$external_ip"

# sleep for a few seconds for the chagne to take effect
# sleep for a few seconds for the change to take effect
sleep 2
az synapse role assignment create --workspace-name $synapse_workspace_name --role "Synapse Contributor" --assignee $service_principal_name

Expand Down
4 changes: 2 additions & 2 deletions docs/how-to-guides/expression-language.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ If the feature transformation can't be accomplished with a short line of express

# Usage Guide

Your data transformation can be composed of one or a few smaller tasks. Divide and conquer! For each individual task, check the following sections on how to acheive them. Then combine them. For example, we have a trip mileage column but it's in string form. We want to compare if it's a long trip(> 30 miles). So we need to cast it into double and then compare with 30. We can do `cast_double(mile_column) > 30`.
Your data transformation can be composed of one or a few smaller tasks. Divide and conquer! For each individual task, check the following sections on how to achieve them. Then combine them. For example, we have a trip mileage column but it's in string form. We want to compare if it's a long trip(> 30 miles). So we need to cast it into double and then compare with 30. We can do `cast_double(mile_column) > 30`.

## Field accessing

Expand Down Expand Up @@ -48,7 +48,7 @@ You can concatenate string with `concat(str1, str2)`. For exmample, `concat("app

## Arithmetic Operations

For data of numeric types, you can use arithmetic operators to perform opterations. Here are the supported operators: `+,-,*,/`
For data of numeric types, you can use arithmetic operators to perform operations. Here are the supported operators: `+,-,*,/`

## Logical Operators

Expand Down
Loading