Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Decorator improvements #316

Merged
merged 5 commits into from
Feb 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 147 additions & 34 deletions decorators.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,8 +285,10 @@ dr = driver.Driver(config, module1, ...)

### @tag

Allows you to attach metadata to an output(s), i.e. all nodes generated by a function and its decorators (note, this only applies to "final" nodes --
not any intermediate nodes that are generated...).
Allows you to attach metadata to an output(s), i.e. all nodes generated by a function and its decorators (note, this, by default, only applies to "final" nodes --
not any intermediate nodes that are generated...). See "A mental model for decorators" below for how/why this works.

```python
A common use of this is to enable marking nodes as part of some data product, or for GDPR/privacy purposes.

For instance:
Expand All @@ -302,7 +304,11 @@ def intermediate_column() -> pd.Series:
def final_column(intermediate_column: pd.Series) -> pd.Series:
pass
```
### @tag_outputs

`@tag` also allows you to specify a target with the `target_` parameter. See "A mental model for decorators" below for more details.


### @tag_outputs (deprecated)

`tag_outputs` enables you to attach metadata to a function that outputs multiple nodes,
and give different tag values to different outputs:
Expand Down Expand Up @@ -387,25 +393,32 @@ The check_output validator takes in arguments that each correspond to one of the
These arguments tell it to add the default validator to the list. The above thus creates
two validators, one that checks the datatype of the series, and one that checks whether the data is in a certain range.


Note that you can also specify custom decorators using the `@check_output_custom` decorator.
`check_output` and `check_output_custom` accept the `_target` parameter, which specifies *which* output to check.

See [data_quality](data_quality.md) for more information on available validators and how to build custom ones.

## @reuse_functions
## @subdag

Currently located under `experimental` -- looking for feedback!

The `@reuse_functions` decorator enables you to rerun components of your DAG with varying parameters. Note that this is immensely powerful -- if we
The `@subdag` decorator enables you to rerun components of your DAG with varying parameters. Note that this is immensely powerful -- if we
draw analogies from Hamilton to standard procedural programming paradigms, we might have the following correspondence:

- `config.when` + friends -- `if/else` statements
- `parameterize`/`extract_columns` -- `for` loop
- `does` -- effectively macros
And so on. `@reuse_functions`takes this one step further.
- `@reuse_functions` -- subroutine definition
And so on. `@subdag` takes this one step further.
- `@subdag` -- subroutine definition
E.G. take a certain set of nodes, and run them with specified parameters.

If you're confused as to why you need this decorator, you should probably stop reading (you most likely don't need it). If this solves a pain point you've had, then continue...
Why might you want to use this? Let's take a look at some examples:

1. You have a feature engineering pipeline that you want to run on multiple datasets. If its exactly the same, this is perfect. If not, this
works perfectly as well, you just have to utilize different functions in each or the `config.when` + `config` parameter to rerun it.
2. You want to train multiple models in the same DAG that share some logic (in features or training) -- this allows you to reuse and continually add more.
3. You want to combine multiple similar DAGs (e.g. one for each business line) into one so you can build a cross-business line model.

This basically bridges the gap between the flexibility of non-declarative pipelining frameworks with the readability/maintainability of declarative ones.

Let's take a look at a simplified example (in [examples/](examples/reusing_functions/reusable_subdags.py)).

Expand All @@ -422,28 +435,27 @@ def unique_users(filtered_interactions: pd.DataFrame, grain: str) -> pd.Series:
"""Gives the number of shares traded by the frequency"""
return ...

@reuse_functions(
with_inputs={"grain": value("day")},
namespace="daily_users_US",
outputs={"unique_users": "unique_users_daily_US"},
with_config={"region": "US"},
load_from=[unique_users, interactions_filtered],
@subdag(
unique_users, interactions_filtered,
inputs={"grain": value("day")},
config={"region": "US"},
)
def quarterly_user_data_US() -> reuse.MultiOutput({"unique_users_daily_US": pd.Series}):
"""Calculates quarterly data for just US users"""
pass
def daily_users_US(unique_users: pd.Series) -> pd.Series:
"""Calculates quarterly data for just US users
Note that this just returns the output. You can modify it if you'd like!
"""

return unique_users


@reuse_functions(
with_inputs={"grain": value("day")},
namespace="daily_users_CA",
outputs={"unique_users": "unique_users_daily_CA"},
with_config={"region": "CA"},
load_from=[unique_users, interactions_filtered],
@subdag(
unique_users, interactions_filtered,
inputs={"grain": value("day")},
config={"region": "CA"},
)
def daily_user_data_CA() -> reuse.MultiOutput({"unique_users_daily_CA": pd.Series}):
def daily_users_CA(unique_users: pd.Series) -> pd.Series:
"""Calculates quarterly data for just canada users"""
pass
return unique_users
```

This example tracks users on a per-value user data. Specifically, we track the following:
Expand All @@ -452,14 +464,33 @@ This example tracks users on a per-value user data. Specifically, we track the f
2. Quarterly user data for the US

These each live under a separate namespace -- this exists solely so the two sets of similar nodes can coexist.
Note this set is contrived to demonstarte functionality -- it should be easy to imagine how we could add more variations.
Note this set is contrived to demonstrate functionality -- it should be easy to imagine how we could add more variations.

The inputs to the `subdag` decorator takes in a variety of inputs that determine _which_ possible set of functions will constitute the DAG, and then specifically _what_ the DAG is and _how_ it is connected.
- _which_ functions constitute the subdag is specified by the `*args` input, which is a collection of modules/functions. These are used to determine the functions (i.e. nodes) that could end up in the produced subDAG.
- _what_ the sub-DAG is, and _how_ it is connected is specified by two parameters. `config` provides configuration to use to generate the subDAG (in this case the region), and `inputs` provides inputs to connect the subdag with the current DAG (similar in spirit to `@parameterize`).


The inputs to the `reuse_functions` decorator takes in a variety of inputs that determine _which_ functions to reuse, _how_ to use them, and _where_ they should live.
- _which_ functions to reuse is specified by the `load_from` input, which is either a collection of modules or a collection of functions. These are used to resolve nodes that will end up in the produced subDAG.
- _how_ to reuse the functions is specified by two parameters. `with_config` provides configuration overrides to use to generate the subDAG (in this case the region), and `with_inputs` provides inputs to the nodes (fimilar to `parameterize`)
- _where_ the produced subDAG shoud live is specified by two parameters. `namespace` gives a namespace under which these nodes live. All this means is that a nodes name will be `{namespace}.{node_name}`.
`outputs` provides a mapping so you can access these later, without referring to the namespace. E.G. `outputs={"unique_users": "unique_users_daily_US"}` means that the `unique_users` output from this
subDAG will get mapped to the node name `unique_users_daily_US`. This way you can use it as a function parameter later on.
Note that, if you wanted to do this functionality without this decorator, you'd have two options:
1. Rewrite every function for each scenario -- this is repetetive and doesn't scale
2. Utilize the `driver` within the functions -- E.G.

```python
def daily_users_CA(unique_users: pd.Series) -> pd.Series:
"""Calculates quarterly data for just canada users"""
dr = hamilton.driver.Driver({"region" : "CA"}, unique_users, interactions_filtered)
return dr.execute(["unique_users"], inputs={"grain": value("day")})["unique_users"]
```

While this is a clever approach (and you can use it if you want), there are a few drawbacks:

1. You have no visibility into the subdag -- hamilton has no knowledge about what's being run
2. Any execution parameters have to be reproduced for the driver. E.G. running on dask, etc..., is unknown to the driver.
3. DAG compilation is done at runtime -- you would not know any type-errors until the function is running

That said, this *is* a good mental model for explaining subdag functionality. Think about
it as a driver within a function, but in a way that hamilton can understand and operate.
Best of both worlds!

## @parameterize_extract_columns

Expand Down Expand Up @@ -558,3 +589,85 @@ Models (optionally) accept a `output_column` parameter -- this is specifically i
from the output column that it should represent. E.G. if you use the model result as an intermediate object, and manipulate
it all later. At Stitch Fix this is necessary because various dependent columns that a model queries
(e.g. `MULTIPLIER_...` and `OFFSET_...`) are derived from the model's name.

## A mental model for decorators

Decorators fall under a few broad categories:

1. Decorators that decide whether a function should resolves into a node (E.G. config.when())
2. Decorators that *turn* a function into a set of nodes (E.G. `@does`)
3. Decorators that modify a set of nodes, turning it into another set of nodes (E.G. `@parameterize`, `@extract_columns`, and `@check_output`)
Comment on lines +598 to +599
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound like there's a difference between (2) & (3)?
Otherwise going between function and nodes is confusing I think -- all the decorators operate on a function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference. The decorators in (3) don't operate on functions, they operate on nodes.

The decorators in (2) have access to the function and use it to determine the subdag shape, the decorators in (2) don't.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but does that explanation make sense to an end user? I don't think we've articulated the distinction...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, TBH I think we need to work on making it clearer, and its difficult for me to do when I'm in the nodes headspace. So, let's revisit that and see how users react? Added a note about functions/nodes, but let's see how users react when we're rewriting documentation.


Note the distinction between "functions" and "nodes". "functions" are what the user writes, and in the plain case,
correspond 1:1 to nodes. Nodes are then generated from functions -- decorators control how that happens. Thus, one
function can correspond to 0, 1 or many nodes.

The decorators that do this are broken up into subclasses that each have various nuances, one of which is whether or not they allow layering.
To layer effectively, we need to know *which* nodes a decorator will modify from the result of another decorators. For the ones
that allow this type of layers, they take in a `target_` parameter (with `_` due to the fact that they often have **kwargs) that tells *which*
nodes they should modify.

There are 4 options for `target_`:

1. None -- this is the default behavior. This only applies to nodes generated by the function that are not
depended on by any other nodes generated by the function.
2. `...` (the literal python Ellipsis). This means that *all* nodes generated by the function and its decorators
will have the transformation applied to them.
3. A string. This means that only the node with the specified name will have the transformation applied.
4. A list of strings. This means that all nodes with the specified name will have the transformation applied.

This is very powerful, and we will likely be applying it to more decorators. Currently `tag` and `check_output` are the only
decorators that accept this parameter.

To dive into this, let's take a look at the `tag` decorator with the default `target_`:

This will only decorate `col1`, `col2`, and `col3`:

```python
import pandas as pd
from hamilton.function_modifiers import tag, extract_columns

@tag(type="dataframe")
@extract_columns("col1", "col2", "col3")
def dataframe() -> pd.DataFrame:
pass
```

Once we decorate it with `target_="original_dataframe"`, we are now *just* decorating
the dataframe that gets extracted from (that is a node after all).

```python
import pandas as pd
from hamilton.function_modifiers import tag, extract_columns

@tag(type="dataframe", target_='original_dataframe')
@extract_columns("col1", "col2", "col3")
def dataframe() -> pd.DataFrame:
pass
```

Whereas the following would tag all the columns we extract one value:

```python
import pandas as pd
from hamilton.function_modifiers import tag, extract_columns

@tag(type="extracted_column", target_=["col1", "col2", "col3"])
@extract_columns("col1", "col2", "col3")
def dataframe() -> pd.DataFrame:
pass
```

The following would tag *everything*

```python
import pandas as pd
from hamilton.function_modifiers import tag, extract_columns

@tag(type="any_node_created", target_=...)
@extract_columns("col1", "col2", "col3")
def dataframe() -> pd.DataFrame:
pass
```

Passing `None` to `target_` is the same as not passing it at all -- this decorates the "final" nodes as we saw in the first example.
12 changes: 12 additions & 0 deletions examples/reusing_functions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# subdag operator

This README demonstrates the use of the subdag operator.

The subdag operator allows you to effectively run a driver within a node.
In this case, we are calculating unique website visitors from the following set of parameters:

1. Region = CA (canada) or US (United States)
2. Granularity of data = (day, week, month)

You can find the code in [unique_users.py](unique_users.py) and [reusable_subdags.py](reusable_subdags.py)
and look at how we run it in [main.py](main.py).
41 changes: 23 additions & 18 deletions examples/reusing_functions/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,23 +46,28 @@ def build_result(self, **outputs: Dict[str, Any]) -> Any:
return pd.DataFrame(resampled_results).bfill()


dr = Driver(
{},
reusable_subdags,
adapter=SimplePythonGraphAdapter(
result_builder=TimeSeriesJoinResultsBuilder(upsample_frequency="D")
),
)
def main():
dr = Driver(
{},
reusable_subdags,
adapter=SimplePythonGraphAdapter(
result_builder=TimeSeriesJoinResultsBuilder(upsample_frequency="D")
),
)

result = dr.execute(
[
"daily_unique_users_US",
"daily_unique_users_CA",
"weekly_unique_users_US",
"weekly_unique_users_CA",
"monthly_unique_users_US",
"monthly_unique_users_CA",
]
)

print(result)

result = dr.execute(
[
"unique_users_daily_US",
"unique_users_daily_CA",
"unique_users_weekly_US",
"unique_users_weekly_CA",
"unique_users_monthly_US",
"unique_users_monthly_CA",
]
)

print(result)
if __name__ == "__main__":
main()
Loading