Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature engineering in multiple contexts examples #94

Merged
merged 6 commits into from
Mar 4, 2023

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Mar 4, 2023

Adds feature engineering in multiple examples.

This adds two scenarios:

  1. one where there is no feature store, and you want to recompute features in two places.
  2. one where this is a feature store, however the feature store only stores "raw" data -- so you want to use that
    as the place to get data from to compute features for input to the model.

In the process, this fixes #93, as this issue was found while building out this example.

Changes

  1. adds examples
  2. makes extract* work with async functions.

How I tested this

  1. Runs locally.

Notes

This isn't an exhaustive set of examples on this topic and how you could use Hamilton. For example this omits talking about:

  1. Streaming settings. Though arguably reusing Hamilton functions would be simpler in that context.
  2. How to ask Hamilton what features are needed as input to know what to request from the feature store. With tags, and querying the DAG at the start of the app, you could dynamically ask Hamilton what's required and then only go to the
    feature store for that data. But I thought that might be a little too complex, so I leave it on the TODO list.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@skrawcz skrawcz marked this pull request as ready for review March 4, 2023 05:28
@skrawcz skrawcz requested a review from elijahbenizzy March 4, 2023 05:28
@elijahbenizzy elijahbenizzy mentioned this pull request Mar 4, 2023
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first question everyone is going to be asking is "is it slow to be dataframes when its a single item", and the answer is "Not even close to as slow as that API call we're making".

So, let's add:

  1. A section on why we're using pd.Series/pd.DataFrame and how that helps us
  2. Alternatives (see idea at @map decorator #95)

skrawcz added 5 commits March 4, 2023 13:08
Assumptions:
 - the API request can provide the same raw data that training provides.
 - if you have aggregation features, you need to store the training result
for them, and provide them to the online side.

This example shows how one might use Hamilton to compute
features in an offline and online fashion. The assumption here
is that the request passed into the API has all the raw data
required to compute features.

This example also shows how one might "override" some values
that are required for computing features, in this example they
are `age_mean` and `age_std_dev`. This can be required when you
computing aggregation features does not make sense at
inference time.
This is a less than ideal solution, but basically if the function being wrapped is a coroutine,
then make the wrapper async as well. This duplicates a lot of code. I'm sure there's
a more succinct way to do this, but because I'm time pressed I'm doing the more
verbose solution.

Note: this doesn't fix it for all decorators, just the extract* ones.

Adds async tests to double check and ensure that things work as expected.
To make it clear that this isn't a generic feature engineering example, but
one about doing it in multiple contexts.
This example here is contrived again. However it should illustrate how you can replace
getting data with Hamilton quite easily. The example wont fit the needs of everyone, since
people's needs will likely fall in between scenario 1 and scenario 2, but hopefully it provides
them enough context to get going with feature engineering and Hamilton.
This updates the sphinx docs with a high level overview of feature engineering,
for various contexts, with links to examples.
@skrawcz skrawcz force-pushed the feature-engineering-examples branch from 07b7000 to 6263e3e Compare March 4, 2023 21:32
So that it's clear there are caveats that people should think about
when doing feature engineering.

Adds FAQ section, with single Feast question.
@skrawcz skrawcz force-pushed the feature-engineering-examples branch from 6263e3e to 4944623 Compare March 4, 2023 21:54
@skrawcz skrawcz merged commit 0254a4e into main Mar 4, 2023
@skrawcz skrawcz deleted the feature-engineering-examples branch March 4, 2023 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract* doesn't work with async function
2 participants