-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable mutation of the output of nodes in a linear fashion via decorators #922
Comments
This would be a really cool functionality to have! I'm considering the following scenario: We can even add something like turning mutations on or off at building time to test how mutations affect the result. --
Let me know if I misunderstood or missed something from the original idea. |
Do you have a code example of what it would look like and the updates you'd need to make? I think the key here is just determining what workflows we want to enable this to work best for, along with showing what the friction points we currently have are. E.g. is it notebook dev? or is it reusing transforms across multiple data sets? or? What does the current code look like to achieve that outcome? This will then help guide how one registers these additional functions. Now that I think about it, this also seems also pretty close the ideas with that |
Here is a very rough sketch of what I had in mind: # sample_module.py
import pandas as pd
from dev import mutate
def my_function(input: pd.Series) -> pd.Series:
return input
def other_function(input: pd.Series) -> pd.Series:
return input
def no_mutations(input: pd.Series) -> pd.Series:
return input
@mutate(functions=["my_function", "other_function"])
def _transform_1(input_function: pd.Series) -> pd.Series:
return input_function + 1
@mutate(functions=["other_function"])
def _transform_2(input_function: pd.Series) -> pd.Series:
return input_function * 2 Here is a very crude implementation, but the essential idea is that we find the right functions and build a # I think we need to touch only two places
from hamilton.graph_utils import find_functions
import inspect
from types import ModuleType
from typing import Callable, List, Tuple
import collections
import functools
from hamilton.function_modifiers.macros import pipe_output, step
import sample_module
# the actual decorator #############################################################################
def mutate(functions):
def decorator(fn):
@functools.wraps(fn)
def wrapped_fn(*args, **kwargs):
return fn(*args, **kwargs)
# We make all mutable functions hidden
if not fn.__name__.startswith("_"):
# Probably change __qualname__ as well?
wrapped_fn.__name__ = f"_mutate_{fn.__name__}"
else:
wrapped_fn.__name__ = f"_mutate_{fn.__name__[1:]}"
wrapped_fn.functions_to_mutate = functions
return wrapped_fn
return decorator
# somewhere in graph.py ############################################################################
def find_mutations(function_module: ModuleType) -> List[Tuple[str, Callable]]:
"""Function to determine the set of functions we want to build a graph from.
This iterates through the function module and grabs all function definitions.
:return: list of tuples of (func_name, function).
"""
def valid_fn(fn):
return (
inspect.isfunction(fn)
and "_mutate_" in fn.__name__
and is_submodule(inspect.getmodule(fn), function_module)
)
return [f for f in inspect.getmembers(function_module, predicate=valid_fn)]
def create_mutation_pipeline(functions, mutations):
pipelines = collections.defaultdict(list)
for mutation in mutations:
f = mutation[1]
for target_function in f.__dict__["functions_to_mutate"]:
pipelines[target_function].append((f))
return pipelines
def attach_pipelines(functions,pipelines):
new_functions = []
for f in functions:
if not f[0] in pipelines:
new_functions.append((f[0],f[1]))
continue
steps = []
for transform in pipelines[f[0]]:
# place to bind additional args and stuff you can put into step
steps.append(step(transform))
new_functions.append((f[0],pipe_output(*steps)(f[1])))
if __name__ == "__main__":
functions = sum([find_functions(module) for module in [sample_module]], [])
mutations = sum([find_mutations(module) for module in [sample_module]], [])
print(functions)
print(mutations)
pipelines = create_mutation_pipeline(functions=functions, mutations=mutations)
print(pipelines)
new_functions = attach_pipelines(functions=functions, pipelines=pipelines)
print(new_functions) |
re: |
Another use case sketch - I want to pull data but keep it the same name... def cust_data(...) -> pd.DataFrame:
... pull data
@mutate(cust_data)
@check_output(...)
def _filter(cust_data: pd.DataFrame) -> pd.DataFrame:
... filter it
@mutate(cust_data) # do we assume target is first argument?
def _join(cust_data: pd.DataFrame, foo_dep: pd.DataFrame) -> pd.DataFrame:
... join ; we would allow other dag dependencies I think
def aggregate_data(cust_data: pd.DataFrame) -> pd.DataFrame:
... this depends on final cust_data node -- so would be a caveat for users that the functions below still apply.
... (maybe we would have to have a linter to sort files with @mutate)
@mutate(cust_data, some_other_data) # some_other_data is another function; doesn't have to be function pointers, but could be strings.
def _sort(generic_df: pd.DataFrame) -> pd.DataFrame:
... this is a generic helper that is applied to two functions...
@mutate(cust_data, target_="some_data") # required for combined decorator usage?
@with_columns(...)
def _features(some_data: pd.DataFrame) -> pd.DataFrame:
... with columns add features... Things to think about:
|
Nice, that makes a strong point for the use case that we shouldn't limit
Will work on it more tomorrow to get something concrete out. |
Some random thoughts: Mutate should only be able to work with a select few other decorators:
Other decorators I don't think make much sense. Things to think through: tricky part is that we have "step" that turns into a "node" and that's what we want to potentially apply the check_output and subdag operators on... We could constraint this initial feature to not enable this to not make this too big of a project -- but this should be taken into account. |
The other way to think about this, is what code is it replacing: def raw_cust_data(...) -> pd.DataFrame:
... pull data
@check_output(...)
def cust_data_filtered(raw_cust_data: pd.DataFrame) -> pd.DataFrame:
... filter it
def cust_data_joined(cust_data_filtered: pd.DataFrame, foo_dep: pd.DataFrame) -> pd.DataFrame:
... join
def aggregate_data(cust_data_features: pd.DataFrame) -> pd.DataFrame:
.... aggregate
def cust_data_sorted(cust_data_filtered: pd.DataFrame) -> pd.DataFrame:
... sorted
def some_data_sorted(some_data: pd.DataFrame) -> pd.DataFrame:
...
@with_columns(...)
def cust_data_features(cust_data_sorted: pd.DataFrame) -> pd.DataFrame:
... with columns add features... So it replaces:
|
Closing since #1160 was merged |
Is your feature request related to a problem? Please describe.
This is similar to #701 -- but a distributed version.
People don't want to change column/dataframe/artifact names. This conflicts with Hamilton. Is there some API we can support?
Describe the solution you'd like
One idea is that you pipe transforms together and have the framework group things so that there isn't a renaming issue.
E.g.
Notes:
_
which is reserved for private functions. Which is fine I think because these transform functions aren't useful by themselves -- and shouldn't be exposed directly. It also gets around the naming issue of (1) --- we can have the decorator register and capture these. Open decision as to what "declares" the connection to the function -- the first argument name? or the name of the function? or?data_set
and then checks what was registered against it via@mutate
. One initial constraint we can have is that@mutate
has to be in the same module; We should design for multi-module, but as a first pass constrain to same module...data_set
as the result of applying all those transforms.Describe alternatives you've considered
Alternative / additional decorator use could be:
To enable one to have functions names that don't have to match.
This could then help one to write mutations like this -- which I think is a potential vote for allowing multi-module registration --and using module order then to imply transform order application.
Additional context
Here's some code that proves you can intercept and register the functions in mutate decorator.
Then:
The text was updated successfully, but these errors were encountered: