Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature engineering in multiple contexts examples #94

Merged
merged 6 commits into from
Mar 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 133 additions & 2 deletions docs/how-tos/use-for-feature-engineering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,136 @@
Use Hamilton for Feature Engineering
==========================================

This is currently a work in progress -- see the `feature_eng_examples <https://github.com/DAGWorks-Inc/hamilton/tree/feature_eng_example>`_ branch for progress.
Soon, it will show how you can reuse feature transformations in a batch and online setting.
Hamilton's roots are in time-series offline feature engineering. But it can be used for any type of feature engineering:
offline, streaming, online. All our examples are oriented towards Pandas, but rest assured, you can use Hamilton with
any python objects, e.g. numpy, polars, and even pyspark.

Here's a 20 minute video (`slides <https://github.com/skrawcz/talks/files/9759661/FS.Summit.2022.-.Hamilton.pdf>`__), with
brief backstory on Hamilton, and an overview (at around the 8:52 mark) of how to use it for feature engineering which
was presented at the Feature Store Summit 2022:

.. raw:: html

<iframe width="560" height="315" src="https://www.youtube.com/embed/b9tfdNZZ-nk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

Otherwise here we present a high level overview and then direct users to the examples folder for more details. We suggest
reading the Offline Feature Engineering section first, since it's the most common use case, and helps explain the
python module structure you should be going for with Hamilton. If you need more guidance here, please reach out to us on
`slack <https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg>`__.


Offline Feature Engineering
---------------------------
To use Hamilton for offline feature engineering, a common pattern is:

1. create a data_loader module(s) that loads the data from the source(s) (e.g. a database, a csv file, etc.).
2. create feature transform module(s) that transform the data into features.
3. create a data set module(s) that combines the data_loader and feature transform modules if you want to connect fitting \
a model with Hamilton. Or, you do this data set definition in your driver code.

Here is a sketch of the above pattern:

.. code-block:: python

# data_loader.py
@extract_columns(*...) # you can choose to expose individual columns
def load_data(...) -> pd.DataFrame:
return pd.read_csv(...)
...
# feature_transform.py
def feature_a(raw_input_a: pd.Series, ...) -> pd.Series:
return raw_input_a + ...
...
# dataset.py (optional)
def model_set_x(feature_a: pd.Series, ...) -> pd.DataFrame:
return pd.DataFrame({'feature_a': feature_a, ...})
# run.py
def main():
dr = driver.Driver(config, data_loader, feature_transform, dataset)
feature_df = dr.execute([feature_transform.feature_a, ...])
...


Hamilton Example
__________________
We do not provide a specific example here, since most of the examples in the examples folder fall under this category.
Some examples to browse:

* `Hello World <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/hello_world>`__ shows the basics of how to
use Hamilton.
* `Data Quality <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/data_quality>`__ shows how to incorporate
runtime data quality checks into your feature engineering pipeline.
* `Time-series Kaggle Example <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/model_examples/time-series>`__
shows one way to structure your code to ingest, create features, and fit a model.
* `Feature engineering in multiple contexts <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/feature_engineering_multiple_contexts>`__
helps show how you can use Hamilton in multiple contexts reusing code where possible, e.g. offline, & online.
* `PySpark UDF Map Examples <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/spark/pyspark_udfs>`__
shows how to use Hamilton to encode map operations for use with PySpark.


Streaming Feature Engineering
-----------------------------
Right now, there is no specific streaming support. Instead, we model the problem as we would for offline. Hamilton
has an `inputs=` argument to the `execute()` function in the driver. This allows you to then instantiate a Hamilton
Driver once, and then call `execute()` multiple times with different inputs. Otherwise you'd have a similar python
module structure as for offline feature engineering -- perhaps just dropping the data_loader module since you would
provide the inputs directly to the `execute()` function.

Here's a sketch of how you might use Hamilton in conjunction with a Kafka Client:

.. code-block:: python

# run.py
def main():
kakfa_client = KafkaClient(...)
dr = driver.Driver(config, feature_transform)
for batch in kafka_client.get_batches(): # this is pseudo code, but you get the idea
feature_df = dr.execute([feature_transform.feature_a, ...], inputs=batch.to_dict())
# do something / emit back to kafka, etc.


**Caveats to think about**. Here are some things to think about when using Hamilton for streaming feature engineering:

- aggregation features, you likely want to understand whether you want to aggregate over the entire stream or just \
the current batch, or load values that were computed offline.


Hamilton Example
__________________
Currently we don't have a streaming example. But we are working on it. We direct users to look at the online example
for now, since conceptually from a modularity stand point, things would be set up in a similar way.

Online Feature Engineering
--------------------------
Online feature engineering can be quite simple or quite complex, depending on your situation. However, good news is,
that Hamilton should be able to help you in any situation. The modularity of Hamilton allows you to swap out implementations
of features easily, as well as override values, and even ask the Driver what features are required from the source data
to create the features that you want. We think Hamilton can help you keep things simple, but then extend to helping you
handle more complex situations.

The basic structure of your python modules, does not change. Depending on whether you want Hamilton to load data from a feature store,
or you have all the data passed in, you just need to appropriately segment your feature transforms into modules, or use
the `@config.*` decorator, to help you segment your feature computation dataflow to give you the flexibility you need.

*Caveats to think about*. Here are some things to think about when using Hamilton for online feature engineering:

- aggregation features, most likely you'll want to load aggregated feature values that were computed offline, rather \
than compute them live.

We skip showing a sketch of structure here, and invite you to look at the examples below.

Hamilton Example
__________________
We direct users to look at `Feature engineering in multiple contexts <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/feature_engineering_multiple_contexts>`__
that currently describes two scenarios around how you could incorporate Hamilton into an online web-service, and have
it aligned with your batch offline processes. Note, these examples should give you the high level first principles
view of how to do things. Since having something running in production , we didn't want to get too specific.


FAQ
----

Q. Can I use Hamilton for feature engineering with Feast?
__________________________________________________________
Yes, you can use Hamilton with Feast. Typically people use Hamilton on the offline side to compute features that then
get pushed to Feast. For the online side it varies as to how to integrate the two.
138 changes: 138 additions & 0 deletions examples/feature_engineering_multiple_contexts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Feature Engineering in Multiple Contexts

What is feature engineering? It's the process of transforming data for input to a "model".

To make models better, it's common to perform and try a lot of "transforms". This is where Hamilton comes in.
Hamilton allows you to:
* write different transformations in a straightforward and formulaic manner
* keep them managed and versioned with computational lineage (if using something like git)
* has a great testing and documentation story

which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

In this series of examples, we'll skip talking about the benefits of Hamilton here, and instead focus on how to use it
for feature engineering. But first, some context on what challenges you're likely to face with feature engineering
in general.

# What is hard about feature engineering?
There are certain dimensions that make feature engineering hard:

1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
2. Lineage: Keeping track of what data is being used for what purpose.
3. Deployment: Offline vs online vs streaming needs.

## Code: Organizing and maintaining code for reuse/collaboration/discoverability.
> Individuals build features, but teams own them.

Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

Why? The code for feature engineering is often spread out across many files, and created by many individuals.
E.g. scripts, notebooks, libraries, etc., and written in many ways. This makes it hard to reuse code,
collaborate, discover what code is available, and therefore maintain what is actually being used in "production" and
what is not.

## Lineage: Keeping track of what data is being used for what purpose
With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what
data is being used and for what purpose is important for the business to easily answer. A "modeler" a lot of the times
is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on
their plate to address, which slows down their ability to build and ship features and thus models.

Not having lineage or visibility into what data is being used for what purpose can lead to a lot of problems:
- teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
- teams are not aware of what data is available to them, e.g. duplication of data & effort.
- teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
- teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.


## Deployment: Offline vs online vs streaming needs
This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

(1) There are a lot of different deployment needs when you get something to production. For example, you might want to:
- run a batch job to generate features for a model
- hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
- run a streaming job to generate features for a model in realtime
- require all three or a subset of the above ways of deploying features.

So the challenge is, how do you design your processes to take in account your deployment needs?

(2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or
do we need to reimplement them for every system that we want to use them in?

With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your
feature engineering processes. They have to connect with each other, and be flexible enough to support your specific
deployment needs.

# Using Hamilton for Feature Engineering for Batch/Offline
If you fall into **only** needing to deploy features for batch jobs, then stop right there. You don't need these examples,
as they are focused on how to bridge the gap between "offline" and "online" feature engineering. You should instead
browse the other examples like `data_quality`.

# Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming
These example scenarios here are for the people who have to deal with both batch and online feature engineering.

We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these
scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to
not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case,
e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

## Scenario Context
A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
happens is that the code for features is not shared, and results in two implementations
that result in subtle bugs and hard to maintain code.

With this example series, we show how you can use Hamilton to:

1. write a feature once. (scenarios 1 and 2)
2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
3. show how to modularize components so that if you have values cached in a feature store,
you can inject those values into your feature computation needs. (scenario 2)

The task that we're modeling here isn't that important, but if you must know, we're trying to predict the number of
hours of absence that an employee will have given some information about them; this is based off the `data_quality`
example, which is based off of the [Metaflow+Hamilton example](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags/),
where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

Assumptions we're using:
1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
2. We are agnostic of the actual model -- and skip any details of that in the examples.
3. We use Pandas as the data structure in our example here because it's easy to reuse in a batch, and online context. However you
need not use Pandas if you don't want to.

Let's explain the context of the two scenarios a bit more.

## Scenario 1: the simple case - ETL + Online API
In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

This is a straightforward process if all your feature transforms are [map operations](https://en.wikipedia.org/wiki/Map_(higher-order_function)).
If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline
ETL with online.

In this example, there are two features, `age_mean` and `age_std_dev`, that we avoid recomputing in an online setting.
Instead, we "store" the values for them when we compute features in the offline ETL, and then use those "stored" values
at prediction time to get the right feature computation to happen.

## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

We will pretend to hit a feature store, that will provide us with the required data to compute the features for
input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the "source"
of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather
than it also coming in via the request. Note: if using a feature store, which is effectively a cache, you might not need
Hamilton on the online side, if, and only if, you can get all the data you need from the feature store, without needing
to perform any computations. In this situation, you would push compute features to the feature store from your offline
ETL process that creates features.

A good exercise would be to make note of the differences with this scenario (2) and scenario (1) in how they structure
the code with Hamilton.

# What's next?
Jump into each directory and read the README, it'll explain how the example is set up and how things should work.

# What are extensions/uses not shown here but we know you can do them
Here are two ideas that come to mind:

1. Streaming settings. Given the examples, it should be clear how to make it possbile to use Hamilton in a streaming setting.
2. How to ask Hamilton what features are needed as input to know what to request from the feature store. With tags, and
querying the DAG at the start of the app, you could dynamically ask Hamilton what's required and then only go to the
feature store for that data. If this type of example would be of interest, let us know.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading