Adds/edits/updates some documentation

stitchfix · May 13, 2021 · 4386518 · 4386518
1 parent 7b336b1
commit 4386518
Show file tree

Hide file tree

Showing 11 changed files with 540 additions and 331 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
 # Guidance on how to contribute
 
-> All contributions to this project will be released under the Affero General Public License v3 (AGPLv3). 
+> All contributions to this project will be released under the Affero General Public License v3 (AGPLv3).
 > By submitting a pull request or filing a bug, issue, or
 > feature request, you are agreeing to comply with this waiver of copyright interest.
 > Details can be found in our [CLA](CLA.md) and [LICENSE](LICENSE).
@@ -29,5 +29,7 @@ Generally speaking, you should fork this repository, make changes in your
 own fork, and then submit a pull request. All new code should have associated
 unit tests that validate implemented features and the presence or lack of defects.
 Additionally, the code should follow any stylistic and architectural guidelines
-prescribed by the project. In the absence of such guidelines, mimic the styles
-and patterns in the existing code-base.
+prescribed by the project. For us here, this means you install a pre-commit hook and use
+the given style files. Basically, you should mimic the styles and patterns in the Hamilton code-base.
+
+In terms of getting setup to develop, we invite you to read our [developer setup guide](developer_setup.md).
diff --git a/README.md b/README.md
diff --git a/basics.md b/basics.md
@@ -0,0 +1,138 @@
+# Hamilton Basics
+
+There are two parts to Hamilton:
+
+1. Hamilton Functions.
+
+   Hamilton Functions are what you, the end user write.
+
+2. Hamilton Driver.
+
+   Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate
+   execution.
+
+Let's diver deeper into these parts below, but first a word on terminology.
+
+We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :
+
+* column
+* variable
+* node
+* function
+
+That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is
+ a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions
+we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.
+
+## Hamilton Functions
+Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.
+
+A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:
+
+```python
+def _sum(*vars):
+    """Helper function to sum numbers.
+    This is here to demonstrate that functions starting with _ do not get processed by hamilton.
+    """
+    return sum(vars)
+
+def sum_a_b(a: int, b: int) -> int:
+    """Adds a and b together
+    :param a: The first number to add
+    :param b: The second number to add
+    :return: The sum of a and b
+    """
+    return _sum(a,b) # Delegates to a helper function
+```
+
+While this looks like a simple python function, there are a few components to note:
+1. The function name `sum_a_b` is a globally unique key. In the DAG there can only be one function named `sum_a_b`.
+   While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated,
+   and separate out that logic for debugging/iterating.
+2. The function `sum_a_b` depends on two upstream nodes -- `a` and `b`. This means that these values must either be:
+    * Defined by another function
+    * Passed in by the user as a configuration variable (see `Hamilton Driver Code` below)
+3. The function `sum_a_b` makes full use of the python type-hint system. This is required in Hamilton,
+   as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case,
+   we know that the input `a` has to be an integer, the input `b` has to also be an integer, and anything that declares `sum_a_b` as an input
+   has to declare it as an integer.
+4. Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and
+   nodes, each function documentation also describes a piece of business logic.
+5. Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function
+   in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.
+
+
+### Python Types & Hamilton
+
+Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However,
+this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly.
+The following will not work:
+
+```python
+def some_func() -> Dict[str, int]:
+    return {1: 2}
+```
+
+The following will both work:
+```python
+def some_func() -> Dict:
+    return {1: 2}
+```
+
+```python
+def some_func() -> dict:
+    return {1: 2}
+```
+
+While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.
+
+## Hamilton Driver Code
+For documentation on the actual Hamilton Driver code, we invite the reader to [read the Driver class source code](/hamilton/driver.py) directly.
+
+At a high level, the driver code does two things:
+
+1. Create a Directed Acyclic Graph (DAG) from functions you define.
+   ```python
+   from hamilton import driver
+   dr = driver.Driver(config, *modules_to_load)  # this creates the DAG from the modules you pass in.
+   ```
+2. It orchestrates execution given expected output and provided input.
+   ```python
+   df = dr.execute(final_vars, overrides, display_graph)  # this executes the DAG appropriately to create the dataframe.
+   ```
+
+The driver object also has a few other methods, e.g. `display_all_functions()`, `list_available_variables()`, but they're
+really only used for debugging purposes.
+
+Let's dive into the driver constructor call, and the execute method.
+
+### Constructor Call to Driver()
+The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration.
+So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.
+
+#### config: Dict[str, Any], e.g. Configuration
+The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG.
+As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.
+
+Otherwise the contents of the _config_ dictionary should include all the inputs required for whatever final output you
+want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
+To do this, use the `override` parameter as part of the `execute()` -- see below.
+
+#### *modules: ModuleType
+This can be any number of modules. We traverse the modules in the order they are provided.
+
+### Driver.execute()
+The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want
+in the dataframe. It also ensures that you have provided everything to execute properly.
+
+Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG
+via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python
+recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest
+displaying the DAG to verify this.
+
+To help speed up development of new or existing Hamilton Functions, we enable you to _override_ parts of the DAG. What
+this means is that before calling `execute()`, you have computed some result that you want to use instead of what Hamilton
+would produce. To do so, you just pass in a dictionary of `{'col_name': YOUR_VALUE}` as the overrides argument to the
+execute function.
+
+To visualize the DAG that would be executed, pass the flag `display_graph=True` to execute. It will render an image in a pdf format.
diff --git a/decorators.md b/decorators.md
@@ -0,0 +1,209 @@
+# Decorators
+
+While the 1:1 mapping of column -> function implementation is powerful, we've implemented a few decorators to promote
+business-logic reuse. The decorators we've defined are as follows
+(source can be found in [function_modifiers](hamilton/function_modifiers.py)):
+
+## @parameterized
+Expands a single function into n, each of which corresponds to a function in which the parameter value is replaced by
+that specific value.
+```python
+import pandas as pd
+from hamilton.function_modifiers import parametrized
+import internal_package_with_logic
+
+ONE_OFF_DATES = {
+     #output name        # doc string               # input value to function
+    ('D_ELECTION_2016', 'US Election 2016 Dummy'): '2016-11-12',
+    ('SOME_OUTPUT_NAME', 'Doc string for this thing'): 'value to pass to function',
+}
+            # parameter matches the name of the argument in the function below
+@parametrized(parameter='one_off_date', assigned_output=ONE_OFF_DATES)
+def create_one_off_dates(date_index: pd.Series, one_off_date: str) -> pd.Series:
+    """Given a date index, produces a series where a 1 is placed at the date index that would contain that event."""
+    one_off_dates = internal_package_with_logic.get_business_week(one_off_date)
+    return internal_package_with_logic.bool_to_int(date_index.isin([one_off_dates]))
+```
+We see here that `parameterized` allows you keep your code DRY by reusing the same function to create multiple
+distinct outputs. The _parameter_ key word argument has to match one of the arguments in the function. The rest of
+the arguments are pulled from outside the DAG. The _assigned_output_ key word argument takes in a dictionary of
+tuple(Output Name, Documentation string) -> value.
+
+## @parametrized_input
+Expands a single function into n, each of which corresponds to a function in which the parameter value is fed
+the input from a specific column
+```python
+import pandas as pd
+from hamilton.function_modifiers import parametrized_input
+import internal_package_with_logic
+
+ONE_OFF_DATES = {
+     #input var        (# output var,               # description of new outputs)
+     'D_ELECTION_2016', ('D_ELECTION_2016_shifted', 'US election 2016 shifted by 1'),
+     'SOME_INPUT_NAME', ('SOME_OUTPUT_NAME', 'Doc string for this thing'),
+}
+            # parameter matches the name of the argument in the function below
+@parametrized_input(parameter='one_off_date', assigned_inputs=ONE_OFF_DATES)
+def date_shifter(one_off_date: pd.Series) -> pd.Series:
+    return one_off_date.shift(1)
+
+```
+We see here that `parameterized_inputs` allows you keep your code DRY by reusing the same function to create multiple
+distinct outputs. The _parameter_ key word argument has to match one of the arguments in the function. The rest of
+the arguments are pulled from items inside the DAG the DAG. The _assigned_inputs_ key word argument takes in a 
+dictionary of input_column -> tuple(Output Name, Documentation string).
+
+Note that this is equivalent to writing the following two function definitions:
+
+```python
+def D_ELECTION_2016_shifted(D_ELECTION_2016: pd.Series) -> pd.Series:
+    return D_ELECTION_2016.shift(1)
+
+def SOME_OUTPUT_NAME(SOME_INPUT_NAME: pd.Series) -> pd.Series:
+    return SOME_INPUT_NAME.shift(1)
+```
+
+Note also that the different input variables must all have compatible types with the original decorated input variable.
+
+## @extract_columns
+This works on a function that outputs a dataframe, that we want to extract the columns from and make them individually
+available for consumption. So it expands a single function into _n functions_, each of which take in the output dataframe
+ and output a specific column as named in the `extract_coumns` decorator.
+```python
+import pandas as pd
+from hamilton.function_modifiers import extract_columns
+
+@extract_columns('fiscal_date', 'fiscal_week_name', 'fiscal_month', 'fiscal_quarter', 'fiscal_year')
+def fiscal_columns(date_index: pd.Series, fiscal_dates: pd.DataFrame) -> pd.DataFrame:
+    """Extracts the fiscal column data.
+    We want to ensure that it has the same spine as date_index.
+    :param fiscal_dates: the input dataframe to extract.
+    :return:
+    """
+    df = pd.DataFrame({'date_index': date_index}, index=date_index.index)
+    merged = df.join(fiscal_dates, how='inner')
+    return merged
+```
+Note: if you have a list of columns to extract, then when you call `@extract_columns` you should call it with an
+asterisk like this:
+```python
+import pandas as pd
+from hamilton.function_modifiers import extract_columns
+
+@extract_columns(*my_list_of_column_names)
+def my_func(...) -> pd.DataFrame:
+   """..."""
+```
+
+## @does
+`@does` is a decorator that essentially allows you to run a function over all the input parameters. So you can't pass
+any function to `@does`, it has to take any amount of inputs and process them in the same way.
+```python
+import pandas as pd
+from hamilton.function_modifiers import does
+import internal_package_with_logic
+
+def sum_series(**series: pd.Series) -> pd.Series:
+    ...
+
+@does(sum_series)
+def D_XMAS_GC_WEIGHTED_BY_DAY(D_XMAS_GC_WEIGHTED_BY_DAY_1: pd.Series,
+                              D_XMAS_GC_WEIGHTED_BY_DAY_2: pd.Series) -> pd.Series:
+    """Adds D_XMAS_GC_WEIGHTED_BY_DAY_1 and D_XMAS_GC_WEIGHTED_BY_DAY_2"""
+    pass
+
+@does(internal_package_with_logic.identity_function)
+def copy_of_x(x: pd.Series) -> pd.Series:
+    """Just returns x"""
+    pass
+```
+The example here is a function, that all that it does, is sum all the parameters together. So we can annotate it with
+the `@does` decorator and pass it the `sum_series` function.
+The `@does` decorator is currently limited to just allow functions that consist only of one argument, a generic **kwargs.
+
+## @model
+`@model` allows you to abstract a function that is a model. You will need to implement models that make sense for
+your business case. Reach out if you need examples.
+
+Under the hood, they're just DAG nodes whose inputs are determined by a configuration parameter. A model takes in
+two required parameters:
+1. The class it uses to run the model. If external to Stitch Fix you will need to write your own, else internally
+   see the internal docs for this. Basically the class defined determines what the function actually does.
+2. The configuration key that determines how the model functions. This is just the name of a configuration parameter
+   that stores the way the model is run.
+
+The following is an example usage of `@model`:
+
+```python
+import pandas as pd
+from hamilton.function_modifiers import model
+import internal_package_with_logic
+
+@model(internal_package_with_logic.GLM, 'model_p_cancel_manual_res')
+# This runs a GLM (Generalized Linear Model)
+# The associated configuration parameter is 'model_p_cancel_manual_res',
+# which points to the results of loading the model_p_cancel_manual_res table
+def prob_cancel_manual_res() -> pd.Series:
+    pass
+```
+
+`GLM` here is not part of the hamilton framework, and instead a user defined model.
+
+Models (optionally) accept a `output_column` parameter -- this is specifically if the name of the function differs
+from the output column that it right to. E.G. if you use the model result as an intermediate object, and manipulate
+it all later. This is necessary because various dependent columns that a model queries
+(e.g. `MULTIPLIER_...` and `OFFSET_...`) are derived from the model's name.
+
+## @config.when*
+
+`@config.when` allows you to specify different implementations depending on configuration parameters.
+
+The following use cases are supported:
+1. A column is present for only one value of a config parameter -- in this case, we define a function only once,
+   with a `@config.when`
+```python
+    import pandas as pd
+    from hamilton.function_modifiers import config
+
+    # signups_parent_before_launch is only present in the kids business line
+    @config.when(business_line='kids')
+    def signups_parent_before_launch(signups_from_existing_womens_tf: pd.Series) -> pd.Series:
+        """TODO:
+        :param signups_from_existing_womens_tf:
+        :return:
+        """
+        return signups_from_existing_womens_tf
+```
+2. A column is implemented differently for different business inputs, e.g. in the case of Stitch Fix gender intent.
+```python
+    import pandas as pd
+    from hamilton.function_modifiers import config, model
+    import internal_package_with_logic
+
+    # Some 21 day autoship cadence does not exist for kids, so we just return 0s
+    @config.when(gender_intent='kids')
+    def percent_clients_something__kids(date_index: pd.Series) -> pd.Series:
+        return pd.Series(index=date_index.index, data=0.0)
+
+    # In other business lines, we have a model for it
+    @config.when_not(gender_intent='kids')
+    @model(internal_package_with_logic.GLM, 'some_model_name', output_column='percent_clients_something')
+    def percent_clients_something_model() -> pd.Series:
+        pass
+```
+Note the following:
+- The function cannot have the same name in the same file (or python gets unhappy), so we name it with a
+  __ (dunderscore) as a suffix. The dunderscore is removed before it goes into the function.
+- There is currently no `@config.otherwise(...)` decorator, so make sure to have `config.when` specify set of
+  configuration possibilities.
+Any missing cases will not have that output column (and subsequent downstream nodes may error out if they ask for it).
+To make this easier, we have a few more `@config` decorators:
+
+    - `@config.when_not(param=value)` Will be included if the parameter is _not_ equal to the value specified.
+    - `@config.when_in(param=[value1, value2, ...])` Will be included if the parameter is equal to one of the specified
+      values.
+    - `@config.when_not_in(param=[value1, value2, ...])` Will be included if the parameter is not equal to any of the
+      specified values.
+    - `@config` If you're feeling adventurous, you can pass in a lambda function that takes in the entire configuration
+      and resolves to
+    `True` or `False`. You probably don't want to do this.