stitchfix · elijahbenizzy · Feb 18, 2023 · Feb 16, 2023 · Feb 17, 2023 · Feb 17, 2023
diff --git a/decorators.md b/decorators.md
@@ -285,8 +285,10 @@ dr = driver.Driver(config, module1, ...)
 
 ### @tag
 
-Allows you to attach metadata to an output(s), i.e. all nodes generated by a function and its decorators (note, this only applies to "final" nodes --
-not any intermediate nodes that are generated...).
+Allows you to attach metadata to an output(s), i.e. all nodes generated by a function and its decorators (note, this, by default, only applies to "final" nodes --
+not any intermediate nodes that are generated...). See "A mental model for decorators" below for how/why this works.
+
+```python
 A common use of this is to enable marking nodes as part of some data product, or for GDPR/privacy purposes.
 
 For instance:
@@ -302,7 +304,11 @@ def intermediate_column() -> pd.Series:
 def final_column(intermediate_column: pd.Series) -> pd.Series:
     pass
 ```
-### @tag_outputs
+
+`@tag` also allows you to specify a target with the `target_` parameter. See "A mental model for decorators" below for more details.
+
+
+### @tag_outputs (deprecated)
 
 `tag_outputs` enables you to attach metadata to a function that outputs multiple nodes,
 and give different tag values to different outputs:
@@ -387,25 +393,32 @@ The check_output validator takes in arguments that each correspond to one of the
 These arguments tell it to add the default validator to the list. The above thus creates
 two validators, one that checks the datatype of the series, and one that checks whether the data is in a certain range.
 
+
 Note that you can also specify custom decorators using the `@check_output_custom` decorator.
+`check_output` and `check_output_custom` accept the `_target` parameter, which specifies *which* output to check.
 
 See [data_quality](data_quality.md) for more information on available validators and how to build custom ones.
 
-## @reuse_functions
+## @subdag
 
-Currently located under `experimental` -- looking for feedback!
-
-The `@reuse_functions` decorator enables you to rerun components of your DAG with varying parameters. Note that this is immensely powerful -- if we
+The `@subdag` decorator enables you to rerun components of your DAG with varying parameters. Note that this is immensely powerful -- if we
 draw analogies from Hamilton to standard procedural programming paradigms, we might have the following correspondence:
 
 - `config.when` + friends -- `if/else` statements
 - `parameterize`/`extract_columns` -- `for` loop
 - `does` -- effectively macros
-And so on. `@reuse_functions`takes this one step further.
-- `@reuse_functions` -- subroutine definition
+And so on. `@subdag` takes this one step further.
+- `@subdag` -- subroutine definition
 E.G. take a certain set of nodes, and run them with specified parameters.
 
-If you're confused as to why you need this decorator, you should probably stop reading (you most likely don't need it). If this solves a pain point you've had, then continue...
+Why might you want to use this? Let's take a look at some examples:
+
+1. You have a feature engineering pipeline that you want to run on multiple datasets. If its exactly the same, this is perfect. If not, this
+works perfectly as well, you just have to utilize different functions in each or the `config.when` + `config` parameter to rerun it.
+2. You want to train multiple models in the same DAG that share some logic (in features or training) -- this allows you to reuse and continually add more.
+3. You want to combine multiple similar DAGs (e.g. one for each business line) into one so you can build a cross-business line model.
+
+This basically bridges the gap between the flexibility of non-declarative pipelining frameworks with the readability/maintainability of declarative ones.
 
 Let's take a look at a simplified example (in [examples/](examples/reusing_functions/reusable_subdags.py)).
 
@@ -422,28 +435,27 @@ def unique_users(filtered_interactions: pd.DataFrame, grain: str) -> pd.Series:
     """Gives the number of shares traded by the frequency"""
     return ...
 
-@reuse_functions(
-    with_inputs={"grain": value("day")},
-    namespace="daily_users_US",
-    outputs={"unique_users": "unique_users_daily_US"},
-    with_config={"region": "US"},
-    load_from=[unique_users, interactions_filtered],
+@subdag(
+    unique_users, interactions_filtered,
+    inputs={"grain": value("day")},
+    config={"region": "US"},
 )
-def quarterly_user_data_US() -> reuse.MultiOutput({"unique_users_daily_US": pd.Series}):
-    """Calculates quarterly data for just US users"""
-    pass
+def daily_users_US(unique_users: pd.Series) -> pd.Series:
+    """Calculates quarterly data for just US users
+    Note that this just returns the output. You can modify it if you'd like!
+    """
+
+    return unique_users
 
 
-@reuse_functions(
-    with_inputs={"grain": value("day")},
-    namespace="daily_users_CA",
-    outputs={"unique_users": "unique_users_daily_CA"},
-    with_config={"region": "CA"},
-    load_from=[unique_users, interactions_filtered],
+@subdag(
+    unique_users, interactions_filtered,
+    inputs={"grain": value("day")},
+    config={"region": "CA"},
 )
-def daily_user_data_CA() -> reuse.MultiOutput({"unique_users_daily_CA": pd.Series}):
+def daily_users_CA(unique_users: pd.Series) -> pd.Series:
     """Calculates quarterly data for just canada users"""
-    pass
+    return unique_users
 ```
 
 This example tracks users on a per-value user data. Specifically, we track the following:
@@ -452,14 +464,33 @@ This example tracks users on a per-value user data. Specifically, we track the f
 2. Quarterly user data for the US
 
 These each live under a separate namespace -- this exists solely so the two sets of similar nodes can coexist.
-Note this set is contrived to demonstarte functionality -- it should be easy to imagine how we could add more variations.
+Note this set is contrived to demonstrate functionality -- it should be easy to imagine how we could add more variations.
+
+The inputs to the `subdag` decorator takes in a variety of inputs that determine _which_ possible set of functions will constitute the DAG, and then specifically _what_ the DAG is and _how_ it is connected.
+- _which_ functions constitute the subdag is specified by the `*args` input, which is a collection of modules/functions. These are used to determine the functions (i.e. nodes) that could end up in the produced subDAG.
+- _what_ the sub-DAG is, and _how_ it is connected is specified by two parameters. `config` provides configuration to use to generate the subDAG (in this case the region), and `inputs` provides inputs to connect the subdag with the current DAG (similar in spirit to `@parameterize`).
+
 
-The inputs to the `reuse_functions` decorator takes in a variety of inputs that determine _which_ functions to reuse, _how_ to use them, and _where_ they should live.
-- _which_ functions to reuse is specified by the `load_from` input, which is either a collection of modules or a collection of functions.  These are used to resolve nodes that will end up in the produced subDAG.
-- _how_ to reuse the functions is specified by two parameters. `with_config` provides configuration overrides to use to generate the subDAG (in this case the region), and `with_inputs` provides inputs to the nodes (fimilar to `parameterize`)
-- _where_ the produced subDAG shoud live is specified by two parameters. `namespace` gives a namespace under which these nodes live. All this means is that a nodes name will be `{namespace}.{node_name}`.
-`outputs` provides a mapping so you can access these later, without referring to the namespace. E.G. `outputs={"unique_users": "unique_users_daily_US"}` means that the `unique_users` output from this
-subDAG will get mapped to the node name `unique_users_daily_US`. This way you can use it as a function parameter later on.
+Note that, if you wanted to do this functionality without this decorator, you'd have two options:
+1. Rewrite every function for each scenario -- this is repetetive and doesn't scale
+2. Utilize the `driver` within the functions -- E.G.
+
+```python
+def daily_users_CA(unique_users: pd.Series) -> pd.Series:
+    """Calculates quarterly data for just canada users"""
+    dr = hamilton.driver.Driver({"region" : "CA"}, unique_users, interactions_filtered)
+    return dr.execute(["unique_users"], inputs={"grain": value("day")})["unique_users"]
+```
+
+While this is a clever approach (and you can use it if you want), there are a few drawbacks:
+
+1. You have no visibility into the subdag -- hamilton has no knowledge about what's being run
+2. Any execution parameters have to be reproduced for the driver. E.G. running on dask, etc..., is unknown to the driver.
+3. DAG compilation is done at runtime -- you would not know any type-errors until the function is running
+
+That said, this *is* a good mental model for explaining subdag functionality. Think about
+it as a driver within a function, but in a way that hamilton can understand and operate.
+Best of both worlds!
 
 ## @parameterize_extract_columns
 
@@ -558,3 +589,85 @@ Models (optionally) accept a `output_column` parameter -- this is specifically i
 from the output column that it should represent. E.G. if you use the model result as an intermediate object, and manipulate
 it all later. At Stitch Fix this is necessary because various dependent columns that a model queries
 (e.g. `MULTIPLIER_...` and `OFFSET_...`) are derived from the model's name.
+
+## A mental model for decorators
+
+Decorators fall under a few broad categories:
+
+1. Decorators that decide whether a function should resolves into a node (E.G. config.when())
+2. Decorators that *turn* a function into a set of nodes (E.G. `@does`)
+3. Decorators that modify a set of nodes, turning it into another set of nodes (E.G. `@parameterize`, `@extract_columns`, and `@check_output`)
+
+Note the distinction between "functions" and "nodes". "functions" are what the user writes, and in the plain case,
+correspond 1:1 to nodes. Nodes are then generated from functions -- decorators control how that happens. Thus, one
+function can correspond to 0, 1 or many nodes.
+
+The decorators that do this are broken up into subclasses that each have various nuances, one of which is whether or not they allow layering.
+To layer effectively, we need to know *which* nodes a decorator will modify from the result of another decorators. For the ones
+that allow this type of layers, they take in a `target_` parameter (with `_` due to the fact that they often have **kwargs) that tells *which*
+nodes they should modify.
+
+There are 4 options for `target_`:
+
+1. None -- this is the default behavior. This only applies to nodes generated by the function that are not
+depended on by any other nodes generated by the function.
+2. `...` (the literal python Ellipsis). This means that *all* nodes generated by the function and its decorators
+   will have the transformation applied to them.
+3. A string. This means that only the node with the specified name will have the transformation applied.
+4. A list of strings. This means that all nodes with the specified name will have the transformation applied.
+
+This is very powerful, and we will likely be applying it to more decorators. Currently `tag` and `check_output` are the only
+decorators that accept this parameter.
+
+To dive into this, let's take a look at the `tag` decorator with the default `target_`:
+
+This will only decorate `col1`, `col2`, and `col3`:
+
+```python
+import pandas as pd
+from hamilton.function_modifiers import tag, extract_columns
+
+@tag(type="dataframe")
+@extract_columns("col1", "col2", "col3")
+def dataframe() -> pd.DataFrame:
+    pass
+```
+
+Once we decorate it with `target_="original_dataframe"`, we are now *just* decorating
+the dataframe that gets extracted from (that is a node after all).
+
+```python
+import pandas as pd
+from hamilton.function_modifiers import tag, extract_columns
+
+@tag(type="dataframe", target_='original_dataframe')
+@extract_columns("col1", "col2", "col3")
+def dataframe() -> pd.DataFrame:
+    pass
+```
+
+Whereas the following would tag all the columns we extract one value:
+
+```python
+import pandas as pd
+from hamilton.function_modifiers import tag, extract_columns
+
+@tag(type="extracted_column", target_=["col1", "col2", "col3"])
+@extract_columns("col1", "col2", "col3")
+def dataframe() -> pd.DataFrame:
+    pass
+```
+
+The following would tag *everything*
+
+```python
+import pandas as pd
+from hamilton.function_modifiers import tag, extract_columns
+
+@tag(type="any_node_created", target_=...)
+@extract_columns("col1", "col2", "col3")
+def dataframe() -> pd.DataFrame:
+    pass
+```
+
+Passing `None` to `target_` is the same as not passing it at all -- this decorates the "final" nodes as we saw in the first example.
diff --git a/examples/reusing_functions/README.md b/examples/reusing_functions/README.md
@@ -0,0 +1,12 @@
+# subdag operator
+
+This README demonstrates the use of the subdag operator.
+
+The subdag operator allows you to effectively run a driver within a node.
+In this case, we are calculating unique website visitors from the following set of parameters:
+
+1. Region = CA (canada) or US (United States)
+2. Granularity of data = (day, week, month)
+
+You can find the code in [unique_users.py](unique_users.py) and [reusable_subdags.py](reusable_subdags.py)
+and look at how we run it in [main.py](main.py).
diff --git a/examples/reusing_functions/main.py b/examples/reusing_functions/main.py
@@ -46,23 +46,28 @@ def build_result(self, **outputs: Dict[str, Any]) -> Any:
         return pd.DataFrame(resampled_results).bfill()
 
 
-dr = Driver(
-    {},
-    reusable_subdags,
-    adapter=SimplePythonGraphAdapter(
-        result_builder=TimeSeriesJoinResultsBuilder(upsample_frequency="D")
-    ),
-)
+def main():
+    dr = Driver(
+        {},
+        reusable_subdags,
+        adapter=SimplePythonGraphAdapter(
+            result_builder=TimeSeriesJoinResultsBuilder(upsample_frequency="D")
+        ),
+    )
+
+    result = dr.execute(
+        [
+            "daily_unique_users_US",
+            "daily_unique_users_CA",
+            "weekly_unique_users_US",
+            "weekly_unique_users_CA",
+            "monthly_unique_users_US",
+            "monthly_unique_users_CA",
+        ]
+    )
+
+    print(result)
 
-result = dr.execute(
-    [
-        "unique_users_daily_US",
-        "unique_users_daily_CA",
-        "unique_users_weekly_US",
-        "unique_users_weekly_CA",
-        "unique_users_monthly_US",
-        "unique_users_monthly_CA",
-    ]
-)
 
-print(result)
+if __name__ == "__main__":
+    main()